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Preface 


In 2013, we edited the first edition of this book to provide an overview of chal- 
lenges and developments in digital image forensics. This was the first book on the 
topic providing a comprehensive review of prominent research themes as well as 
the perspectives of legal professionals and forensic practitioners. Since then, the 
research in this field has expanded dramatically. On one hand, the volume of multi- 
media content generated and exchanged over the Internet increased hugely. Images 
and videos represent an ever-increasing share of the data traveling on the net and the 
preferred communications means for most users, especially of the younger genera- 
tions. Therefore, they are also the preferred target of malicious attacks. On the other 
hand, with the advent of deep learning, attacking multimedia is becoming easier and 
easier. Ten years ago, manipulating an image in a credible way required uncommon 
technical skills and significant efforts. Manipulating a video, beyond trivial frame- 
level operations, was possible only for movie productions. This scenario has radically 
changed now, and multimedia manipulation is at anyone’s reach. Sophisticated tools 
based on autoencoders or generative adversarial networks are freely available, as 
well as huge datasets to train them at will. Deepfakes have become commonplace 
and represent a real and growing menace for the integrity of information, with reper- 
cussions on various sectors of society, from the economy to journalism to politics. 
And these problems can only worsen as new and increasingly sophisticated forms of 
manipulation are taking shape. 

Unsurprisingly, multimedia manipulation and forensics are becoming hot topics, 
not just for the research community but for society at large. Several no-profit orga- 
nizations have been created to investigate these phenomena. Major IT companies, 
like Google and Facebook, published online large datasets of manipulated videos 
to foster research on these topics. Funding agencies are promoting large research 
projects. Especially remarkable is the Media Forensic initiative (MediFor) launched 
by DARPA in 2016 to support leading research groups on media integrity all over 
the world, which generated important outcomes in terms of methods and reference 
datasets. This was followed in 2020 by the Semantics Forensics initiative (SemaFor) 
aiming at semantic-level detectors of fake media, working jointly on text, audio, 
images, and video to counter next-generation attacks. 
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The greatly increased focus on multimedia integrity has caused an intense research 
effort in latest years, which involves not only the multimedia forensics community 
but the much larger field of computer vision. Indeed, deep learning takes the lion’s 
share not only on the attacker’s side but also on the defender’s side. It provides the 
analyst with new powerful forensic tools for detection. Given a suitably large training 
set, deep learning architectures ensure a significant performance gain with respect to 
conventional methods and much higher robustness to post-processing and evasions 
techniques. In addition, deep learning can be used to address other fundamental 
problems, such as source identification. In fact, being able to track the provenance 
of a multimedia asset helps establishing its integrity and, as sensors become more 
sophisticated and software-oriented, source identification tools must follow suit. 
Likewise, neural networks, especially generative adversarial networks, are natural 
candidates to perform precious counter-forensics tasks. In synthesis, deep learning 
is part of the problem but also part of the solution, and the research activity going on 
in this cat-and-mouse game seems to be continuously growing. 

All this said, we felt this was the right time to create an updated version of that 
book, with largely renovated content, to cover the rapid advancements in multimedia 
forensics and provide an updated review of the most promising methods and state- 
of-the-art techniques. 

We organized the book into three main parts following a brief introduction of 
development in imaging technologies. The first part will provide an overview of 
key findings related to the attribution problem. The work pertaining to the integrity 
of images and videos is collated in the second part. The last part features work on 
combatting adversarial techniques that rise as a significant challenge to forensics 
methods. 
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Notation 


We use the following notational conventions. 


Concerning random variables: 


Random variables by uppercase roman letters, e.g., X, Y. 

Particular realizations of a random variable by corresponding lowercase letters, 
e.g., X, y. 

A sequence of  r.v.s and its particular realizations will be, respectively, denoted 
by X” and x”. 

Predicted random variables by the hat sign, e.g., X. 

The alphabet of a random variable will be indicated by the capital calligraphic 
letter corresponding to the r.v. X. 


Concerning matrices and vectors: 


Matrices by boldface capital letters, e.g., A. 

The entry in the i-th row and j-th column of a matrix A by AG, j). 
Column vectors by boldface lowercase letters, e.g., x. 

< v, w >: scalar product between vectors v and w. 

Operators are set in normal font, for example, the transpose AT. 


Concerning sets: 


Set names by uppercase calligraphic letters, i.e., A-Z. 
Sets will be specified by listing their objects within curly brackets, i.e., {... }. 
Standard number sets by Blackboard bold typeface, e.g., N, Z, R, etc. 


Concerning functions: 


Functions are typeset as a single lowercase letter, in math font, e.g., f(A). 

Multi-letter functions, and in particularly commonly known functions, are set 

! og ° N P. 

in lowercase normal font, e.g., cos(x), corr(x, y), or j* = argmin > ;_, A(i, j). 
J 


xi 


xii 


Notation 


— To deal with specific functions from a field of functions (e.g., spherical 
harmonics basis functions that are indexed by two parameters), we could use 
comma-separated subscripts, e.g., h; ; (x). 

— Non-math annotations are set in normal font, e.g., fspecia (A). The purpose is 
to disambiguate names from variables, like in the example pesa (A). 


Concerning images and videos: 


— I will be exclusively used to refer to an Image or a sequence of images, and T, 
with n € N for multiple images/videos. 

— I(i, j) will be used as a row-major pixel-wise image reference denoting the 
pixel value at row i, column j of I. 

— Similarly, I (Gi, j, t) denotes the value of the (i, j)-th pixel of t-th frame of 
video I. 

— Image pair-wise adjacency matrices will be denoted by M for single matrix 
and M, with n € N for multiple matrices. 

— Row-major adjacency matrix element reference M (i, j) € M for the element 
expressing the relation between the i-th and the j-th images of interest. 


Concerning graphs: 


— A generic graph by G for single and G, with n € N for multiple graphs. 

— G = (V, E) to express a graph’s set of vertices V and set of edges E. 

— Vertex names by v, € V, with z € N. 

— Edge names by e, € E, with n € N; E = (u;, vj) to detail an edge’s pair of 
vertices; if the edge is directed, u; precedes u; in such relation. 


In the context of machine learning and deep learning: 


— A generic dataset by D. 

— Training, validation, and test datasets, respectively, by D,,, Dy, Dis- 

— i—th sample at the input of a DNN and its corresponding label (if indicated) 
by (xi, yi). 

— Loss function used for DNN training by £. 

— Sigmoid activation function o (). 

— Output of a DNN by ó (x) when the input is equal to x. 

— Gradient of function ¢ by Vo. 

— The Jacobian matrix of a multivalued function ¢ by Jọ. 

— The logit value (output of a CNN prior to the final softmax layer) at node i by 
Zi. 

— Mathematical calligraphic capital letters in function form M(-) represent a 
model function (i.e., a Neural Network). 


Concerning norms: 


— L1, L2, squared L2, and L-infinity norms, respectively, by Li, Lo, i, Lo 
— p norm by P.P,. 


PartI 
Present and Challenges 


Chapter 1 A) 
What’s in This Book and Why? geat 


Husrev Taha Sencar, Luisa Verdoliva, and Nasir Memon 


1.1 Introduction 


Multimedia forensics is a societally important and technically challenging research 
area that will need significant effort for the foreseeable future. While the research 
community is growing and work like that in this book demonstrates significant 
progress, many challenges remain. We expect that forensically tackling ever- 
improving media acquisition and generation methods and countering the pace of 
change in media manipulation will continue to require significant technical break- 
throughs in defensive technologies. 

Whether performed at the individual level or class level attribution is at the heart 
of multimedia forensics. This form of passive forensic analysis exploits traces intro- 
duced by the acquisition or generation pipeline and subsequent editing. These traces 
are often subtle and invisible to humans but can be detected currently by analysis 
algorithms and provide mechanisms by which generation or manipulation can be 
automatically detected. Despite significant achievements in this area of research, 
advances in imaging technologies and newly introduced approaches to media syn- 
thesis have a strong potential to render existing forensic capabilities ineffective. 

More critically, media manipulation has now become a pressing problem with 
broad implications. In the past, it required significant skill to create compelling 
manipulations because editing tools, such as Adobe Photoshop, required experienced 
users to alter images convincingly. Over the last several years, the rise of machine 
learning-based technologies has dramatically lowered the skill necessary to create 
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compelling manipulations. For example, Generative Adversarial Networks (GANs) 
can create photo-realistic faces with no skill by an end-user other than the ability to 
refresh a web page. Deepfake algorithms, such as autoencoders to swap faces in a 
video, can create manipulations much more easily than previous generation video 
tools. While these machine learning techniques have many positive uses, they have 
also been misused for darker purposes such as to perpetrate fraud, to create false 
personas, and to attack personal reputations. 

The asymmetry in the operational setting of digital forensics where attackers have 
access to details and inner workings of the involved methods and tools and thereby 
have the freedom to tailor their actions makes the task even more difficult. It won’t 
be surprising to see these advanced capabilities being deployed to counter forensic 
methods. Clearly, these are all pressing and important problems for which technical 
solutions must play a role. 

There are a number of significant challenges that must be addressed by forensics 
community. Imaging technologies are constantly being improved by manufacturers at 
both hardware and software levels with little public information about their specifics. 
This implies a black-box access to these technologies and involves a significant 
reverse engineering effort on the part of researchers. With rapid advancements in 
computational imaging, this task will only get more strenuous. 

Further, new manipulation techniques are becoming available at a rapid pace 
and each generation improves on the previous. As a result, defenders must develop 
ways to limit the a priori knowledge and training samples required from an attack 
algorithm and should assume an open world scenario where they might not have 
knowledge of all the attack algorithms. Meta-learning approaches for developing 
detection algorithms more quickly would also be highly beneficial. Manipulations 
in the wild are propagated via noisy channels, like social media, and so there is a 
need for robust detection technologies that are not fooled by launderings such as 
simple compression or more sophisticated replay attacks. Finally, part of the task 
of forensics is unwinding how the media was manipulated and who manipulated it. 
Consequently, approaches to understanding the provenance of manipulated media 
are necessary. 

All of these challenges motivated us to prepare this book on multimedia foren- 
sics. Our main objective was to discuss new research directions and also present 
some possible solutions to existing problems. In this regard, this book provides a 
timely vehicle for showcasing important research that has emerged in the last years 
addressing media attribution, authenticity verification, and counter-forensics. We 
hope our content also sparks new research efforts in the fight against forgeries and 
misinformation. 


12 Overviews 


Our book begins with an overview of the challenges posed by media manipulation 
as seen from today. In their chapter, Hendrix and Morozoff provide a comprehen- 
sive view of all aspects of the problem to better emphasize the increasing need for 
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advanced media forensics capabilities. This is followed by an examination of foren- 
sics challenges that arise from the increased adoption of computational imaging. The 
chapter by McCloskey assesses the ability to detect automated focus manipulations 
performed by computational cameras. For this, it looks at cues that can be used to 
distinguish optical blur from synthetic blur. The chapter also covers computational 
imaging research with potential future implications on forensics. 

This chapter is followed by a sertes of chapters focusing on different dimensions 
of the attribution problem. Undoubtedly, one of the breakthrough discoveries in 
multimedia forensics is that Photo-Response Non-Uniformity (PRNU) of an imaging 
sensor, which manifests as a unique and permanent pattern introduced to all media 
captured by the sensor, can be used for identification and verification of the source of 
digital media. Our book has three chapters covering different aspects PRNU-based 
source camera attribution. The first chapter by Kirchner reviews the rich body of 
literature on camera identification from sensor noise fingerprints with an emphasis 
on still images from digital cameras and the evolving challenges in this domain. The 
second chapter by Sencar examines extension of these capabilities to video domain 
and outlines recently developed methods for estimating the sensor’s PRNU from 
videos. The third chapter on this topic by Taspinar et al. provides an overview of 
techniques proposed to perform source matching efficiently in the presence of a large 
collection of media. 

Another vertical in attribution domain is the source camera model identification 
problem. Assuming that a picture or video has been digitally acquired with a camera, 
the goal of this research direction is to identify the brand and model of the device 
used at acquisition time without relying on metadata. The chapter by Mandelli et al. 
investigates source camera model identification through pixel analysis and discusses 
wide series of methodologies proposed to solve this problem. 

The remarkable progress of deep learning, in particular Generative Adversarial 
Networks (GANS), has led to the generation of extremely realistic fake facial content. 
The potential for misuse of such capabilities opens up new problem in forensics that 
focuses on identification of GAN fingerprints used in face image synthesis. Neves 
et al. provide an in-depth literature analysis of state-of-the-art detection approaches 
for face synthesis and manipulation as well as spoofing those detectors. 

The next part of our book involves several chapters focusing on integrity and 
authenticity of multimedia. This part starts with a review of physics-based methods 
that mainly analyzes the interaction of light and objects and the geometric mapping of 
light and objects onto the image sensor. In his chapter, Reese reviews the major lines 
of research on physics-based methods and discusses their strengths and limitations. 

The following chapter investigates forensic applications of the Electric Network 
Frequency (ENF) signal. Since ENF serves as an environmental signature captured 
by audio and video recordings made in locations where there is electrical activity, it 
provides an additional dimension for time—location authentication and for inferring 
the grid in which a recording was made. The chapter by Hajj-Ahmad et al. provides 
an overview of the increasing amount of research work that has been done in this 
field. 
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The long-lasting problem of image and video manipulation is revisited in the sub- 
sequent four chapters. However, distinct from conventional manipulation detection 
methods, Cozzolino et al. focus on the data-driven deep learning-based methods pro- 
posed in recent years. This chapter discusses in detail forensics traces these methods 
rely on, and describes architectural solutions used for detection and localization, 
together with the associated training strategies. The next chapter features the new 
class of forgery known as DeepFakes that involve impersonating audios and videos 
generated by deep neural networks. In this chapter, Lyu surveys the state-of-the-art 
DeepFake detection methods and evaluates solutions proposed to tackle this problem 
along with their pros and cons. In the third chapter of the topic, Long et al. provide an 
overview of the work related to video frame deletion and duplication and introduce 
deep learning-based approaches to tackle these two types of manipulations. 

A complementary integrity verification approach is presented next. Several work 
demonstrated that media manipulation not only leaves forensic traces in the audio- 
visual content but also in the file structure. The chapter authored by Piva covers 
proposed forensic methods devoted to the analysis of image and video file formats 
to validate multimedia integrity. 

The last chapter on authenticity verification introduces the problem of provenance 
analysis. By jointly examining multiple multimedia files, instead of making individ- 
ual assessments, and evaluating pairwise relationships between them, the prove- 
nance of multimedia files is more reliably determined. The chapter by Moreira et al. 
addresses this problem by covering state-of-the-art analysis techniques. 

The last part of our book includes two chapters on counter-forensics comple- 
menting the advances brought by deep learning to forensics. The advances brought 
by deep learning have also significantly expanded the capabilities of anti-forensic 
attackers. The first chapter by Barni et al. focuses on adversarial examples in image 
forensics and how they can be used in creation of attacks. It also describes possible 
countermeasures against them, and discusses their effectiveness. The other chapter by 
Stamm et al. focuses on emerging threat posed by GAN-based anti-forensic attacks 
in creating realistic, but completely synthetic forensic traces. 
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Chapter 2 A) 
Media Forensics in the Age get 
of Disinformation 


Justin Hendrix and Dan Morozoff 


2.1 Media and the Human Experience 


Empiricism is the notion that knowledge originates from sensory experience. Implicit 
in this statement is the idea that we can trust our senses. But in today’s world, 
much of the human experience is mediated through digital technologies. Our sensory 
experiences can no longer be trusted a priori. The evidence before us—what we see 
and hear and read—is, more often than not, manipulated. 

This presents a profound challenge to a species that for most of its existence 
has relied almost entirely on sensory experience to make decisions. A group of 
researchers from multiple universities and multiple disciplines that study collective 
behavior recently assessed that the study of the impact of digital communications 
technologies on humanity should be understood as a “crisis discipline,” alongside the 
study of matters such as climate change and medicine (Bak-Coleman et al. 2021). 
Noting that since “our societies are increasingly instantiated in digital form,” they 
argue the potential of digital media to perturb the ways in which we organize our- 
selves, deal with conflict, and solve problems is profound: 


Our social adaptations evolved in the context of small hunter-gatherer groups solving local 
problems through vocalizations and gestures. Now we face complex global challenges from 
pandemics to climate change—and we communicate on dispersed networks connected by 
digital technologies such as smartphones and social media. 


The evidence, they say, is clear: in this new information environment, misinforma- 
tion, disinformation, and media manipulation pose grave threats—from the spread of 
conspiracy theories (Narayanan et al. 2018), rejection of recommended public health 
interventions (Koltai 2020), subversion of democratic processes (Atlantic Council 
2020), and even genocide (Lubinski 2020). 

Media manipulation is only one of a series of problems in the digital information 
ecosystem—but it is a profound one. For most of modern history, trust and verifiabil- 
ity of media—whether printed material, audio and more recently images, video, and 
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other modalities—have been linked to one’s ability to access and coordinate infor- 
mation to previously gathered facts and beliefs. Consequently, in order to deceive 
someone, the deceiver had to rely on the obfuscation of a media object’s salient fea- 
tures and hope that the consumer’s time and ability to perceive these differences was 
not sufficient. This generalization extends from boasts and lies told in the school- 
yard, to false narratives circulated by media outlets, to large-scale disinformation 
campaigns pursued by nations during military conflicts. 

A consistent property and limitation of deceptive practices is that their success 
is linked to two conditions. The first condition is that the ability to detect false 
information is linked to the amount of time we have to process and verify it before 
taking action on it. For instance, showing someone a brief picture of a photoshopped 
face is more likely to be taken as true if the viewer is given less time to examine 
it. The second condition is linked to the amount of time and effort required by the 
deceiver to create the manipulation. Hence, the impact and reach of a manipulation 
is limited by the resources and effort involved. A kid in school can deceive their 
friends with a photoshopped image, but it required Napoleon and the French Empire 
to flood the United Kingdom with fake pound notes during the Napoleonic wars and 
cause a currency crisis. 

A set of technological advancements over the last 30 years has led to a tectonic 
shift in this dynamic. With the adoption of Internet and social media in the 1990s and 
2000s, respectively, the marginal costs required to deceive a victim have plummeted. 
Any person on the planet with a modicum of financial capital is able to reach billions 
of people with their message, effectively breaking the resource dependence of a 
deceiver. This condition is an unprecedented one in the history of the world. 

Furthermore, the maturation and spread of machine learning-powered media tech- 
nologies has provided the ability to generate diverse media at scale, lowering the 
barriers to entry for all content creators. Today, the amount of work required to 
deliver mass content at Internet scale is exponentially decreasing. This is inevitably 
compounding the problem of trust online. 

As a result, our society is approaching a precipice of far-reaching consequences. 
Media manipulation is easy for anyone, even if they are without significant resources 
or technical skills. While conversely the amount of media available to people contin- 
ues to explode, the amount of time dedicated to verifying a single piece of content is 
also falling. As such, if we are to have any hope of adapting our human sensemaking 
to this digital age, we will need media forensics incorporated into our digital infor- 
mation environment in multiple layers. We will require automated tools to augment 
our human abilities to make sense of what we read, hear, and see. 


2.2 The Threat to Democracy 


One of the most pressing consequences of media manipulation is the threat to democ- 
racy. Concern over issues of trust and the role of information in democracies is severe. 
Globally, democracy has run into a difficult patch, with 2020 perhaps among its worst 
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years in decades. The Economist Intelligence Unit’s Democracy Index found that as 
of 2020 only “8.4% of the world’s population live in a full democracy while more 
than a third live under authoritarian rule” (The Economist 2021). In the United States, 
the twin challenges of election disinformation and the fraught public discourse on 
the COVID-19 pandemic exposed weaknesses in the information ecosystem. 

Technology—once regarded as a boon to democracy—is now regarded by some 
as a threat to it. For instance, a 2019 survey of 1000 experts 2 by the Pew Research 
Center and Elon University’s Imagining the Internet Center found that, when pressed 
on the impact of the use of technology by citizens, civil society groups, and gov- 
ernments on core aspects of democracy and democratic representation, 49% of the 
“technology innovators, developers, business and policy leaders, researchers, and 
activists” surveyed said technology would “mostly weaken core aspects of democ- 
racy and democratic representation in the next decade,” while 33% said technology 
would mostly strengthen democracies. 

Perhaps the reason for such pessimism is that current interventions to manage trust 
in the information ecosystem are not taking place against a static backdrop. Social 
media has accompanied a decline of trust in institutions and the media generally, 
exacerbating a seemingly intractable set of social problems that create ripe conditions 
for the proliferation of disinformation and media manipulation. 

Yet, the decade ahead will see even more profound disruptions to the ways in which 
information is produced, distributed, sought, and consumed. A number of techno- 
logical developments are driving the disruption, including the installation of 5G 
wireless networks, which will enable richer media experiences, such as AR/VR and 
having more devices trading information more frequently. The increasing advance 
of various computational techniques to generate, manipulate, and target content is 
also problematic. The pace of technological developments suggest that, by the end 
of the decade, synthetic media may radically change the information ecosystem and 
demand new architectures and systems to aid sensemaking. 

Consider a handful of recent developments: 


e Natural text to speech enables technologies like Google Assistant, but is also being 
used for voice phishing scams. 

e Fully generated realistic images from StyleGAN2—a neural network model—are 
being leveraged to make fake profiles in order to deceive on social media. 

e Deep Fake face swapping videos are now capable of fooling many people who do 
not pay close attention to the video. 


It is natural, in this environment, that the study of the threat itself is a thriving 
area of inquiry. Experts are busy compiling threats, theorizing threat models, study- 
ing the susceptibility of people to different types of manipulations, exploring what 
interventions build resilience to deception, helping to inform the debate about what 
laws and rules need to be in place, and offering input on the policies technology 
platforms should have in place to protect users from media manipulation. 

In this chapter, we will focus on the emerging threat landscape, in order to give the 
reader a sense of the importance of media forensics as a discipline. While the technical 
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work of discerning text, images, video, and audio and their myriad combinations may 
not seem to qualify as a “crisis discipline,” we believe that it is indeed. 


2.3 New Technologies, New Threats 


In this section, we describe selected research trends with high potential impact on the 
threat landscape of synthetic media generation. We hope to ground the reader in the art 
of the possible at the cutting edge of research and provide some intuition to where the 
field is headed. We briefly explain the key contributions and why we believe they are 
significant for students and practitioners of media forensics. This list is not exhaustive. 

These computational techniques will underpin synthetic media applications in the 
future. 


2.3.1 End-to-End Trainable Speech Synthesis 


Most speech generation systems include several processing steps that are customarily 
trained in isolation (e.g., text-to-spectrogram generators, vocoders). Moreover, while 
generative adversarial networks (GANs) have made tremendous progress on visual 
data, their application to human speech still remains limited. A recent paper addresses 
both of these issues with an end-to-end adversarial training protocol (Donahue et al. 
2021). The paper describes a fully differentiable feedforward architecture for text-to- 
speech synthesis, which further reduces the amount of the necessary training-time 
supervision by eliminating the need for linguistic feature alignment. This further 
lowers the barrier for engineering novel speech synthesis systems, simplifying the 
need for data pre-processing to generate speech from textual inputs. 

Another attack vector that has been undergoing rapid progress is voice cloning 
(Hao 2021). Voice cloning can synthetically convert a short audio recording of 
one person’s voice to produce a very similar voice applied in various inauthentic 
capacities. A recent paper proposes a novel speech synthesis system, NAUTILUS, 
which brings text-to-speech and voice cloning under the same framework (Luong 
and Yamagishi 2020). This demonstrates the possibility that a single end-to-end 
model could be used for multiple speech synthesis tasks and could lead to unifica- 
tion recently seen in natural language models (e.g., BERT Devlin et al. 2019 where 
a single architecture beats state of the art in multiple benchmark problems). 


2.3.2 GAN-Based Codecs for Still and Moving Pictures 


Generative adversarial networks (GANSs) have revolutionized image synthesis as they 
rapidly evolved from small-scale toy datasets to high-resolution photorealistic con- 
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tent. They exploit the capacity of deep neural networks to learn complex manifolds 
and enable generative sampling of novel content. While the research started from 
unconditional, domain-specific generators (e.g., human faces), recent work focuses 
on conditional generation. This regime is of particular importance for students and 
practitioners of media forensics as it enables control of the generation process. 

One area where this technology is underappreciated with regard to its misuse 
potential is lossy codecs for still and moving pictures. Visual content abounds with 
inconsequential background content or abstract textures (e.g., grass, water, clouds) 
which can be synthesized in high quality instead of explicit coding. This is particularly 
important in limited bandwidth regimes that typically occur in mobile applications, 
social media, or video conferencing. Examples of recent techniques in this area 
include GAN-based videoconferencing compression from Nvidia (Nvidia Inc 2020) 
and the first high-fidelity GAN-based image codec from Google (Mentzer et al. 2020). 

This may have far-reaching consequences. Firstly, effortless manipulation of video 
conference streams may lead to novel man-in-the-middle attacks and further blur the 
boundary between acceptable and malicious alterations (e.g., while replaced back- 
grounds are often acceptable, the faces or expressions are probably not). Secondly, 
the ability to synthesize an infinite number of content variations (e.g., decode the 
same image with different looking trees) poses a threat to reverse image search—one 
of the very few reliable techniques for fact checkers today. 


2.3.3 Improvements in Image Manipulation 


Progress in machine learning, computer vision, and graphics has been constantly 
lowering the barrier (i.e., skill, time, compute) for convincing image manipulation. 
This threat is well recognized, hence we simply report a novel manipulation technique 
recently developed by University of Washington and Adobe (Zhang et al. 2020). The 
authors propose a method for object removal from photographs that automatically 
detects and removes the object’s shadow. This may be another blow to image forensic 
techniques that look for (in)consistency of physical traces in the photographs. 


2.3.4 Trillion-Param Models 


The historical trend of throwing more compute at a problem to see where the 
performance ceiling is continuous with large-scale language models (Radford and 
Narasimhan 2018; Kaplan et al. 2020). Following on the heels of the much popular- 
ized GPT-3 (Brown et al. 2020), Google Brain recently published their new Switch 
Transformer model (Fedus et al. 2021)—a x 10 parameter increase over GPT-3. 
Interestingly, research trends related to increasing efficiency and reducing costs are 
paralleling the development of these large models. 
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With large sparse models, it is becoming apparent that lower precision (e.g., 16-bit) 
can be effectively used, combined with new chip architecture improvements target- 
ing these lower bit precisions (Wang et al. 2018; Park and Yoo 2020). Furthermore, 
because of the scales of data needed to train these enormous models, sparsification 
techniques are further being explored (Jacobs et al. 1991; Jordan and Jacobs 1993). 
New frameworks are also emerging that facilitate distribution of ML models across 
many machines (e.g., Tensorflow Mesh that was used in Google’s Switch Transform- 
ers) (Shazeer et al. 2018). Finally, an important trend that is also stemming from this 
development is improvements in compression and distillation of models, a concept 
that was explored in our first symposium as a potential attack vector. 


2.3.5 Lottery Tickets and Compression in Generative Models 


Following in the steps of a landmark 2018 paper proposing the neural lottery hypoth- 
esis (Frankle and Carbin 2019), work continues in ways of discovering intelligent 
compression and distillation techniques to prune large models without losing accu- 
racy. A recent paper being presented at ICLR 21 has applied this hypothesis to 
GANs (Chen et al. 2021). There have also been interesting approaches investigat- 
ing improvements in the training dynamics of subnetworks by leveraging differential 
solvers to stabilize GANs (Qin et al. 2020). The Holy Grail around predicting subnet- 
works or initialization weights still remains elusive, and so the trend of training ever 
larger parameter models, and then attempting to compress/distill them, will continue 
until the next ceiling is reached. 


2.4 New Developments in the Private Sector 


To understand how the generative threat landscape is evolving, it is important to 
track developments from industry that could be used in media manipulation. Here, 
we include research papers, libraries for developers, APIs for the general public, 
as well as polished products. The key players include research and development 
divisions of large corporations, such as Google, Nvidia, and Adobe, but we also see 
relevant activities from the startup sector. 


2.4.1 Image and Video 


Nvidia Maxine (Nvidia Inc 2020)—Nvidia released a fully accelerated platform 
software development kit for developers of video conferencing services to build and 
deploy Al-powered features that use state-of-the-art models in their cloud. A notable 
feature includes GAN-based video compression, where human faces are synthesized 
based on facial landmarks and a single reference image. While this significantly 
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improves user experience, particularly for low-bandwidth connections, it opens up 
potential for abuse. 

Smart Portraits and Neural Filters in Photoshop (Nvidia Inc 2021)—Adobe intro- 
duced neural filters into their flagship photo editor, Photoshop. The addition includes 
smart portraits, a machine-learning-based face editing feature which brings the latest 
face synthesis capabilities to the masses. This development continues the trend of 
bringing advanced ML-power image editing to mainstream products. Other notable 
examples of commercially available advanced AI editing tools include Anthropic’s 
PortraitPro (Anthropics Inc 2021) and Skylum’s Luminar (Skylum Inc 2021). 

ImageGPT (Chen et al. 2020) and DALL-E (Ramesh et al. 2021)—Open AI is 
experimenting with extending their GPT-3 language model to other data modalities. 
ImageGPT is an autoregressive image synthesis model capable of convincing com- 


a snail made of tank. a snail with the texture of a tank. 


Fig. 2.1 Synthesized images generated by DALL-E based on a prompt requesting “snail-like tanks” 
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pletion of image prompts. DALL-E is an extension of this idea that combines visual 
and language tokens. It allows for image synthesis based on textual prompts. The 
model shows a remarkable capability in generating plausible novel content based on 
abstract concepts, e.g., “avocado chairs” or “snail tanks,” see Fig. 2.1. 

Runway ML (RunwayML Inc 2021)—Runway ML is a Brooklyn-based startup 
that provides an Al-powered tool for artists/creatives. Their software brings most 
recent open-source ML models to the masses and allows for their adoption in 
image/video editing workflows. 

Imaginaire (Nvidia Inc 2021)—Nvidia released a PyTorch library that contains 
optimized implementation of several image and video synthesis methods. This tool is 
intended for developers and provides various models for supervised and unsupervised 
image-to-image and video-to-video translation. 


2.4.2 Language Models 


GPT-3 (Brown et al. 2020)—OpenAI developed a language model with 175-billion 
parameters. It performs well in few-shot learning scenarios, producing nuanced text 
and functioning code. Shortly after, the model has been released via a Web API for 
general-purpose text-to-text mapping. 

Switch Transformers (Fedus et al. 2021)—Six months after the GPT-3 release, 
Google announced their record-breaking language model featuring over a trillion 
parameters. The model has been described in a detailed technical manuscript, and 
has been published along with a Tensorflow Mesh implementation. 

CLIP (Radford et al. 2021)—Released in early January 2021 by OpenAI, CLIP is 
an impressive zero-shot image classifier. It pre-trained on 400 million text-to-image 
pairs. Given a list of descriptions, CLIP will make a prediction for which descriptor, 
or caption, a given image should be paired. 


2.5 Threats in the Wild 


In this section, we wish to provide a summary of observed disinformation campaigns 
and scenarios that are relevant to students and practitioners of media forensics. We 
would like to underscore new tactics and evolving resources and capabilities of 
adversaries. 


2.5.1 User-Generated Manipulations 


Thus far, user-generated media, aimed at manipulating public opinion, by and large, 
have not made use of sophisticated media manipulation capabilities beyond image, 
video, and sound editing software available to the general public. Multiple politically 
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charged manipulated videos emerged and gained millions of views during the second 
half of 2020. Documented cases related to the US elections include manipulation of 
the visual background of a speaker (Reuters Staff 2021c), adding manipulated audio 
(Weigel 2021), cropping of a footage (Reuters Staff 2021a), manipulated editing of 
a video, and an attempt to spread false claim that an original video is a “deep fake” 
production (Reuters Staff 2021b). 

In all of the cases mentioned, the videos gained millions of views within hours 
on Twitter, YouTube, Facebook, Instagram, TikTok, and other platforms, before they 
were taken down or flagged to emphasize they were manipulated. Often the videos 
emerged from an anonymous account on social media, but were shared widely, in 
part by prominent politicians and campaign accounts on Twitter, Facebook, and 
other platforms. Similar incidents emerged in India (e.g., an edited version of a 
video documenting police brutality) (Times of India Anam Ajmal 2021). 

Today, any user can employ Al-generated images for creating a fake person. As 
recently reported in the New York Times (Hill and White 2021), Generated Photos 
(Generated Photos Inc 2021) can create “unique, worry-free” fake persons for “$2.99 
or 1,000 people for $1000.” Similarly, there’s open advertising on Twitter by Rosebud 
Al for creating human animation (Rosebud AI Inc 2021). 

Takedowns of far-right extremists and conspiracy-related accounts by several 
platforms, including Twitter, Facebook, Instagram, and WhatsApp announcement of 
changes to its privacy policies resulted in reports of migration to alternative platforms 
and chat services. Though it is still too early to assess the long-term impact of this 
change, it has the potential of impacting consumption and distribution of media, 
particularly among audiences that are already keen to proliferate conspiracy theories 
and disinformation. 


2.5.2 Corporate Manipulation Services 


In recent years, investigations by journalists, researchers, and social media platforms 
have documented several information operations that were conducted by state actors 
via marketing and communications firms around the world. States involved included, 
among others, Saudi Arabia, UAE, and Russia (CNN Clarissa Ward et al. 2021). Some 
researchers claim outsourcing information operations has become “the new normal,” 
and so far the operations that were exposed did not spread original synthetic media 
(Grossman and Ramali 2021). 

For those countries seeking the benefits of employing inauthentic content but 
unable to internally develop such a capability, there are plenty of manipulation com- 
panies available for hire. Regimes incapable of building a manipulation capability, 
deprived of the technological resources, and no longer need to worry as an entirely 
new industry of Advanced Persistent Manipulators (APM) (GMFUS Clint Watts 
2021) have emerged in the private sector to meet the demand. Many advertising agen- 
cies, able to make manipulated content with ease, are hired by governments for public 
relations purposes and blur the lines between authentic and inauthentic contents. 
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The use of a third party by state actors tends to be limited to a targeted campaign. 
In 2019, reports by Twitter-linked campaigns run by the Saudi firm Smaat to the 
Saudi government (Twitter Inc 2021). The firm used automated tools to amplify pro- 
Saudi messages, using networks of thousands of accounts. The FBI has accused the 
firm's CEO for being the Saudi government agent in an espionage case involving 
two Saudi former Twitter employees (Paul 2021). 

In 2020, a CNN investigation exposed an outsourced effort by Russia, through 
Ghana and Nigeria to impact U.S. politics. The operation targeted African American 
demographics in the United States using accounts on Facebook, Twitter, and Insta- 
gram. It was mainly manually executed by a dozen troll farm employees in West 
Africa (CNN Clarissa Ward et al. 2021). 


2.5.3 Nation State Manipulation Examples 


For governments seeking to utilize, acquire, or develop synthetic media tools, the 
resources and technology capability needed to create and harness these tools are 
significant, hence the rise of third-party actors is described in Sect. 2.5.2. Below is a 
sampling of recent confirmed or presumed nation state manipulations. 


People’s Republic of China 

Over the last 2 years, Western social media platforms have encountered a size- 
able amount of inauthentic images and video attributed or believed to be stemming 
from China (The Associated Press ERIKA KINETZ 2021; Hatmaker 2021). To 
what degree these attributions are directly connected to the Chinese government is 
unknown, but the incentives align to suggest the Chinese Communist Party (CCP) 
controlling the state, and the country’s information environment creates or condones 
the deployment of these malign media manipulations. 

Australia has made the most public stand against Chinese manipulations. In 
November 2020, Chinese foreign minister and leading CCP Twitter provocateur 
posted a fake image showing an Australian soldier with a bloody knife next to a 
child (BBC Staff 2021). The image sought to inflame tensions in Afghanistan where 
Australian soldiers have recently been found to have killed Afghan prisoners during 
the 2009 to 2013 time period. 

Aside from this highly public incident and Australian condemnation, signs of 
Chinese employment of inauthentic media continue to mount. Throughout the fall 
of 2019 and 2020, Chinese language YouTube news channels proliferated. These 
new programs, created in high volume and distributed on a vast scale, periodically 
mix real information and real people with selectively sprinkled inauthentic content, 
much of which is difficult to assess as an intended manipulation or a form of narrated 
recreation of actual or perceived events. Graphika has reported on these “political 
spam networks” that distribute pro-Chinese propaganda, first targeting Hong Kong 
protests in 2019 (Nimmo et al. 2021), and the more recent addition of English lan- 
guage YouTube videos focusing on US policy in 2020 (Nimmo et al. 2021). Reports 
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Three posts from Facebook and Twitter are direct (though unattributed) quotes from the Chinese government“ 
official announcement of sanctions against 28 former Trump administration officials. The tweet on the right carries 
an image that reads “Anti-China american politicians’ will finaily pay the price for their crazy actions”. 


Fig. 2.2 Screenshots of posts from Facebook and Twitter 


showed the use of inauthentic or misattributed content in both visual and audio forms. 
Accounts used AI-generated account profile images and dubbed voice overs. 

A similar network of inauthentic accounts were spotted immediately following the 
2021 US presidential inauguration. Users with Deep Fake profile photos, which bear 
a notable similarity to Al-generated photos on thispersondoesnotexist.com, tweeted 
verbatim excerpts lifted from the Chinese Ministry of Foreign Affairs, see Fig. 2.2. 
These alleged Chinese accounts were also enmeshed in a publicly discussed cam- 
paign reported by The Guardian as amplifying a debunked ballot video surfacing 
among voter fraud allegations surrounding the U.S. presidential election. 

The accounts’ employment of inauthentic faces, Fig. 2.3, while somewhat sloppy, 
does offer a way to avoid detection as has been seen repeatedly when stolen real 
photos dropped into image or facial recognition software are easily detected. 


Fig. 2.3 Three examples of Deep Fake profile photos from a network of 13 pro-China inauthentic 
accounts on Twitter. Screengrabs of photos directly from Twitter 


France 
Individuals associated with the French military have been linked to a 2020 infor- 
mation operation, aimed at discrediting a similar Russian IO employing trolls to 
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spread disinformation in the Central African Republic and other African countries 
(Graphika and The Stanford Internet Observatory 2021). The French operation used 
fake media to create Facebook accounts. This is an interesting example of tactical 
strategy that employs a fake campaign to fight disinformation. 


Russia/IRA 

The Russian-sponsored Internet Research Agency was found to have created a small 
number of fake Facebook, Twitter, and LinkedIn accounts that were amplifying a 
website posing as an independent news outlet, peacedata.net (Nimmo et al. 2021). 
While IRA troll accounts have been prolific for years, this is the first observed instance 
of their use of AI-generated images. Fake profiles were created to give credibility to 
peacedata.net, who hired unwitting freelance writers to provide left-leaning content 
for audiences, mostly in the US and UK, as part of a calculated, albeit ineffective, 
disruption campaign. 


2.5.4 Use of AI Techniques for Deception 2019-2020 


During the year 2019-2020, at the height of the COVID-19 pandemic, nefarious 
generative AI techniques gained popularity. The following are examples of various 
modalities employed for the purposes of deception and information operations (IO). 

Ultimately, we also observe that commercial developments are rapidly advanc- 
ing the capability for video and image manipulation. While resources required to 
train GANs and other models will remain a prohibitive factor in the near future, 
improvements in image codecs, larger training parameters, lowering precision, and 
compression/distillation create additional threat opportunities. The threat landscape 
will change rapidly in the next 3 to 5 years, requiring media forensics capabilities to 
advance quickly to keep up with the capabilities of adversaries. 


GAN-Generated Images 

The most prevalent documented use so far of AI capabilities in IO has been in utilizing 
available free tools to create fake personas online. Over the past year, the use of GAN 
technology has become increasingly common, in operations linked to state actors, or 
organizations driven by political motivation, see Fig. 2.3. The technology was mostly 
applied in generating profile pictures for inauthentic users on social media, relying on 
open-source available tools such as thispersondoesntexist.com and Generated Photos 
(Generated Photos Inc 2021). 

Several operations used GANs to create hundreds to thousands of accounts that 
spammed platforms with content and comments, in an attempt to amplify messages. 
Some of these operations were attributed to Chinese actors and to PR firms. 

Other operations limited the use of GAN-generated images to create a handful 
of more “believable” “fake personas” that appeared as creators of content. At least 
one of those efforts was attributed to Russia’s IRA. In at least one case, a LinkedIn 
account with a GAN-generated image was used as part of a suspected espionage 
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effort to contact US national security experts (Associated Press 2021). 


Synthetic Audio 

In recent years, there were at least three attempts documented, one of which was 
successful, to defraud companies and individuals using synthetic audio. Two cases 
included use of synthetic audio to impersonate the company CEO in a phone call. 
In 2019, a British energy company transferred over $200,000 following a fraudulent 
phone call, generated with synthetic audio to impersonate the parent company’ CEO, 
requesting the money transfer (Wall Street Journal Catherine Stupp 2021). A similar 
attempt to defraud a US-based company failed in 2020 (Nisos 2021). Symantec has 
reported at least three cases in 2019 of synthetic audio being deployed to defraud 
private companies (The Washington Post Drew Harwell 2021). 


Deception Using GPT-3 
While user-generated manipulations aiming at impacting public opinion have avoided 
advanced technology so far, there was at least one attempt to use GPT-3 in gen- 
erating automated text, by an anonymous user/bot on Reddit (The Next Web 
Thomas Macaulay 2021). The effort seemed to abuse a third-party app access to 
OpenAI API and to give hundreds of comments. In some cases, the bot published 
conspiracy theories and made false statements. 

Additionally, recent reports noted GPT-3 has an “anti-Muslim” bias (The Next Web 
Tristan Greene 2021). This conclusion points to the potential damage that could be 
caused by apps that are using the technology, even with no malicious intentions. 


IO and Fraud Using AI Techniques 

Over a dozen cases of use of synthetic media for information operations and other 
criminal activities were recorded during the past couple of years. In December 2019, 
social media platforms noted the first use of AI techniques in a case of “coordinated 
inauthentic behavior” which was attributed to The BL, a company with ties to the 
publisher of The Epoch Times, according to Facebook. The company created hun- 
dreds of fake Facebook accounts using GAN techniques to create synthetic profile 
pictures. The use of GAN to create false portrait pictures of non-existent personas 
repeated itself in more than a handful of IO that were discovered during 2020, includ- 
ing in operations that were attributed by social media platforms to Russian IRA, to 
China, and to a US-based PR firm. 


2.6 Threat Models 


Just as the state of the art in technologies for media manipulation has advanced, so 
has theorizing around threat models. Conceptual models and frameworks used to 
understand and analyze threats and describe adversary behavior are evolving along- 
side what is known of the threat landscape. It is useful for students of media forensics 
to understand these frameworks generally, as they are used in the field by analysts, 
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Manipulating the narrative Manipulating the social network 
Positive Engage Messages that bringupa Back Actions that increase the 
related but relevant topic importance of the opinion leader 
or create a new opinion leader 
Explain Messages that provides Build Actions that create a group or 
details on or elaborate the appearance of a group 
the topic 
Excite messages that elicit a Bridge Actions that build a connection 
positive emotion such as between two or more groups 
joy or excitement 
Enhance Messages that encourage Boost Actions that grow the size of the 
the topic-group to group or make it appear that it 
continue with the topic has grown 
Negative Dismiss Messages about why the Neutralize Actions decrease the importance 
topic is not important of the opinion leader 
Distort Messages that alter the Nuke Actions that lead to a group 
main message of the being dismantled or breaking up, 
topic or appearing to be broken up 
Dismay Messages that elicit a Narrow Actions that lead to a group 
negative emotion such as becoming sequestered from 
sadness or anger other groups or marginalized 


Distract Discussion about a totally Neglect Actions that reduce the size of 


Fig. 2.4 BEND framework 


technologists, and others who work to understand and mitigate against manipulations 
and deception. 

It is useful to apply different types of frameworks and threat models for different 
kinds of threats. There is a continuum between the sub-disciplines and sub-concerns 
that form media and information manipulation. At times it is possible to seek to 
put too much under the same umbrella. For instance, when analyzing COVID- 
19 misinformation analysis, the processes and models that are used to tackle that 
are extremely different than the processes and models that we use to tackle covert 
influence operations that tend to be state-sponsored, or to understand attacks that 
are primarily financially motivated and conducted by criminal groups. It’s impor- 
tant to think about what models apply in what circumstances, and specifically what 
disciplines can contribute to which models. 

The below examples do not represent a comprehensive selection of threat models 
and frameworks, but serve as examples. 
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2.6.1 Carnegie Mellon BEND Framework 


Developed by Dr. Kathleen Carley, the BEND framework is a conceptual system that 
seeks to make sense of influence campaigns, and their strategies and tactics. Carley 
notes that the “BEND framework is the product of years of research on disinformation 
and other forms of communication based influence campaigns, and communication 
objectives of Russia and other adversarial communities, including terror groups such 
as ISIS that began in late December of 2013.”! The letters that are represented in the 
acronym BEND refer to elements of the framework, such as “B” which references 
“Back, Build, Bridge, and Boost” influence maneuvers. 


2.6.2 The ABC Framework 


Initially developed by Camille Francois, then Chief Innovation Officer at Graphika, 
the ABC framework offers a mechanism to understand “three key vectors character- 
istic of viral deception” in order to guide potential remedies (Francois 2020). In this 
framework: 


e Ais for Manipulative Actors. In Francois’s framework, manipulative actors engage 
knowingly and with clear intent in their deceptions. 

e B is for Deceptive Behavior. Behavior encompasses all of the techniques and tools 
a manipulative actor may use, from engaging a troll farm or a bot army to producing 
manipulated media. 

e Cis for Harmful Content. Ultimately, it is the message itself that defines a campaign 
and often its efficacy. 


TABLE 1 
The ABCDE Framework 
Actor What kinds of actors are involved? This question can help establish, for example, whether the case 
involves a foreign state actor. 
What activities are exhibited? This inquiry can help establish, for instance, evidence of coordination 
Behavior A 
and inauthenticity. 
C What kinds of content are being created and distributed? This line of questioning can help establish, for 
content š 
example, whether the information being deployed is deceptive 
Degree What is the overall impact of the case and whom does it affect? This question can help establish the actual 


harms and severity of the case. 


What is the overall impact of the case and whom does it affect? This question can help establish the actual 


ff 
Effect harms and severity of the case. 


Fig. 2.5 ABCDE framework 


l https://ink.springer.com/article/10.1007/s10588-020-09322-9. 
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Later, Alexandre Alaphilippe, Executive Director of the EU DisinfoLab, proposed 
adding a D for Distribution, to describe the importance of the means of distribution 
to the overall manipulation strategy (Brookings Inst Alexandre Alaphilippe 2021). 
And yet another letter joined the acronym courtesy of James Pamment in the fall 
of 2020-Pamment proposed E to evaluate the Effect of the manipulation campaign 
(Fig. 2.5). 


2.6.3 The AMITT Framework 


The Misinfosec and Credibility Coalition communities, led by Sara-Jayne Terp, 
Christopher Walker, John Gray, and Dr. Pablo Breuer, developed the AMITT (Adver- 
sarial Misinformation and Influence Tactics and Techniques) frameworks to codify 
the behaviours (tactics and techniques) of disinformation threats and responders, to 
enable rapid alerts, risk assessment, and analysis. AMITT proposes diagramming 
and treating equally both threat actor behaviors and response behaviors by the target 
(Gray and Terp 2019). The AMITT TTPs application of tactics, techniques, proce- 
dures, and frameworks is the disinformation version of the ATT&K model, which 
is an information security standard threat model. The AMITT STIX model is the 
disinformation version of the STIX model, which is an information security data 
sharing standard. 


2.6.4 The SCOTCH Framework 


In order to enable analysts “to comprehensively and rapidly characterize an adver- 
sarial operation or campaign,’ Dr. Sam Blazek developed the SCOTCH framework 
(Blazek 2021) (Fig. 2.6). 


e S—Source—the actor who is responsible for the campaign. 

e C—Channel—the platform and its affordances and features that deliver the cam- 
paign. 

O—Objective—the objective of the actor. 

T—Target—the intended audience for the campaign. 

C—Composition—the language, content, or message of the campaign. 
H—Hook—the psychological phenomena being exploited by the campaign. 


Blazek offers this example of how SCOTCH might describe a campaign: 


At the campaign level, a hypothetical example of a SCOTCH characterization is: Non-State 
Actors used Facebook, Twitter, and Low-Quality Media Outlets in an attempt to Undermine 
the Integrity of the 2020 Presidential Election among Conservative American Audiences 
using Fake Images and Videos and capturing attention by Hashtag Hijacking and Posting in 
Shared-Interest Facebook Groups via Cutouts (Fig. 2.7). 
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Fig. 2.6 AMITT framework 
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Fig. 2.7 Deception model effects 
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2.6.5 Deception Model Effects 


Still other frameworks are used to explain how disinformation diffuses in a popula- 
tion. Mathematicians have employed game theory to look at information as quantities 
that diffuse according to certain dynamics over networks. For instance, Carlo Kopp, 
Kevin Korb, and Bruce Mills model cooperation and diffusion in populations (Kopp 
et al. 2018) exposed to “fake news”: 

This model “depicts the relationships between the deception models and the com- 
ponents of system they are employed to compromise.” 


2.6.6 4Ds 


Some frameworks focus on the goals manipulators. Ben Nimmo, then at the Cen- 
tral European Policy Institute and now at Facebook, where he leads global threat 
intelligence and strategy against influence operations, developed the 4Ds to describe 
the tactics of Russian disinformation campaigns (StopFake.org 2021). “They can be 
summed up in four words: dismiss, distort, distract, dismay,’ wrote Nimmo (Fig. 2.8). 


Dismiss may involve denying allegations made against Russia or a Russian actor. 

Distortion includes outright lies, fabrications, or obfuscations of reality. 

e Distract includes methods of turning attention away from one phenomenon to 
another. 

e Dismay is a method to get others to regard Russia as too significant a threat to 

confront. 


2.6.7 Advanced Persistent Manipulators 


Clint Watts, a former FBI Counterterrorism official, similarly thinks about the goals 
of threat actors and describes, in particular, nation states and other well-resourced 
manipulators as Advanced Persistent Manipulators. 


They use combinations of influence techniques and operate across the full spectrum of 
social media platforms, using the unique attributes of each to achieve their objectives. They 
have sufficient resources and talent to sustain their campaigns, and the most sophisticated 
and troublesome ones can create or acquire the most sophisticated technology. APMs can 
harness, aggregate, and nimbly parse user data and can recon new adaptive techniques to 
skirt account and content controls. Finally, they know and operate within terms of service, 
and they take advantage of free-speech provisions. Adaptive APM behavior will therefore 
continue to challenge the ability of social media platforms to thwart corrosive manipulation 
without harming their own business model (Fig. 2.8). 
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Fig. 2.8 Advanced persistent manipulators framework 


2.6.8 Scenarios for Financial Harm 


Jon Bateman, a fellow in the Cyber Policy Initiative of the Technology and Interna- 
tional Affairs Program at the Carnegie Endowment for International Peace, shows 
how assessments of threats can be applied to specific industry areas or targets in his 
work on the threats of media manipulation to the financial system (Carnegie Endow- 
ment John Batemant 2020). 


In the absence of hard data, a close analysis of potential scenarios can help to better gauge 
the problem. In this paper, ten scenarios illustrate how criminals and other bad actors could 
abuse synthetic media technology to inflict financial harm on a broad swath of targets. Based 
on today’s synthetic media technology and the realities of financial crime, the scenarios 
explore whether and how synthetic media could alter the threat landscape. 


2.7 Investments in Countering False Media 


Just as there is substantial effort at theorizing and implementing threat models to 
understand media manipulation, there is a growing effort in civil society, government, 
and the private sector to contend with disinformation. There is a large volume of 
work in multiple disciplines to categorize both the threats and many key indicators 
of population susceptibility to disinformation, including new work on resilience, 
dynamics of exploitative activities, and case studies on incidents and actors, including 
those who are not primarily motivated by malicious intent. This literature combines 
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TABLE 1 
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Ten Synthetic Media Scenarios for Financial Harm 


Target 


Individuals 


Companies 


Markets 


Regulatory 
Structures 


Scenario Role of Synthetic Media 


Voice cloning or face-swap video is used to impersonate 
a wealthy individual and initiate fraudulent transactions. 
1. Identity theft Alternatively, it is used to impersonate a corporate 
officer and gain access to databases of personal 
information, which can enable larger-scale identity theft. 


Voice cloning or face-swap video is used to impersonate 
2. Imposter scam a trusted government official or family member of the 
victim and coerce a fraudulent payment. 


3. Cyber extortion Synthetic pornography of the victim is used for 


blackmail. 
4, Payment Voice cloning or face-swap video is used to impersonate 
fraud a corporate officer and initiate fraudulent transactions. 
5. Stock Voice cloning or face-swap video is used to defame a 


manipulation via corporate leader or falsify a product endorsement, which 
fabricated events can alter investor sentiment. 


Synthetic photos and text are used to construct human- 


° Beds like social media bots that attack or promote a brand, 
manipulation A f À 
Z which can alter investor perception of consumer 

via bots š 

sentiment. 

Synthetic photos and text are used to construct human- 
7. Malicious like social media bots that spread false rumors of bank 
bank run weakness, which can fuel runs 

on cash. 
8. Malicious Voice cloning or face-swap video is used to fabricate a 
flash crash market-moving event. 
9. Fabricated Voice cloning or face-swap video is used to fabricate 
government an imminent interest rate change, policy shift, or 
action enforcement action. 


Synthetic text is used to fabricate comments from the 


IO Regulatory public on proposed financial regulations, which can 


Key Malicious Technique 


©0000 8 Ə 


astroturfing manipulate the rulemaking process. 
© Deepfake @ Fabricated O Synthetic Narrowcast Broadcast 
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Fig. 2.9 Synthetic media scenarios and financial harm (Carnegie Endowment for Interna- 
tional Peace JON BATEMAN 2020) 


with a growing list of government, civil society and media and industry interventions 
aimed at mitigating the harms of disinformation. 

There are now hundreds of groups and organizations in the United States and 
around the world working on various approaches and problem sets to monitor and 
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counter disinformation in democracies, see Fig.2.10. This emerging infrastructure 
presents an opportunity to create connective tissue between these efforts, and to 
provide common tools and platforms that enable their work. Consider three examples: 


1. 


The Disinfo Defense League: A project of the Media Democracy Fund, the DDL 
is a consortium of more than 200 grassroots organizations across the U.S. that 
helps to mobilize minority communities to counter disinformation campaigns 
(ABC News Fergal Gallagher 2021). 

The Beacon Project: Established by the nonprofit IRI combats state-sponsored 
influence operations, in particular, ones from Russia. The project identifies and 
exposes disinformation and works with local partners to facilitate a response 
(Beacon Project 2021). 

CogSec Collab: The Cognitive Security Collaborative that maps and brings 
together information security researchers, data scientists, and other subject-matter 
experts to create and improve resources for the protection and defense of the cog- 
nitive domain (Cogsec Collab 2021). 


At the same time, there are significant efforts underway to contend with the impli- 


cations of these developments in synthetic media and the threat vectors they create. 


Map data from CogSe 


Disinformation Monitoring and Counter Groups E m Sn ee 


bites item anneni com/A 
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> 


Fig.2.10 There are now hundreds of civil society, government, and private sector groups addressing 
disinformation across the globe. (Cogsec Collab 2021) 
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2.7.1 DARPA SEMAFOR 


The Semantic Forensics (SemaFor) program under the Defense Advanced Research 
Projects Agency (DARPA) represents a substantial research and development effort 
to create forensic tools and techniques to develop semantic understanding of content 
and identify synthetic media across multiple modalities (DARPA 2021). 


2.7.2 The Partnership on AI Steering Committee on Media 
Integrity Working Group 


The Partnership on AI has hosted an industry led effort to create similar detectors 
(Humprecht et al. 2020). Technology platforms and universities continue to develop 
new tools and techniques to determine provenance and identify synthetic media and 
other forms of digital media of questionable veracity. The Working Group brings 
together organizations spanning civil society, technology companies, media organi- 
zations, and academic institutions will be focused on a specific set of activities and 
projects directed at strengthening the research landscape related to new technical 
capabilities in media production and detection, and increasing coordination across 
organizations implicated by these developments. 


2.7.3 JPEG Committee 


In October 2020, the JPEG committee released a draft document initiating a discus- 
sion to explore “if a JPEG standard can facilitate a secure and reliable annotation of 
media modifications, both in good faith and malicious usage scenarios” (Joint Bi level 
Image Experts Group and Joint Photographic Experts Group 2021). The committee 
noted the need to consider revision of standardization practices to address pressing 
challenges, including use of AI techniques (“Deep Fakes”) and use of authentic media 
out of context for the purposes of spreading misinformation and forgeries. The paper 
notes two areas of focus: (1) having a record of modifications of a file within its meta- 
data and (2) proving the record is secured and accessible so provenance is traceable. 


2.7.4 Content Authenticity Initiative (CAI) 


The CAI was founded in 2019 and led by Adobe, Twitter, and The New York Times, 
with the goal of establishing standards that will allow better evaluation of content. In 
August 2020, the forum published its white paper “Setting the Standard for Digital 
Content Attribution” highlighting the importance of transparency around provenance 
of content (Adobe, New York Times, Twitter, The Content Authenticity Initiative 
2021). CAI will “provide a layer of robust, tamper-evident attribution and history 
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data built upon XMP, Schema.org, and other metadata standards that goes far beyond 
common uses today.” 


2.7.5 Media Review 


Media Review is an initiative led by the Duke Reporters Lab to provide the fact 
checker community a schema that would allow evaluation of videos and images, 
through tagging of different types of manipulation (Schema.org 2021). Media Review 
is based on the ClaimReview project, and is currently a pending schema. 


2.8 Excerpts on Susceptibility and Resilience to Media 
Manipulation 


Like other topics in forensics, media manipulation is directly tied to psychology. 
Though a thorough treatment of this topic is outside the scope of this chapter, we 
would like to expose the reader to some foundational work in these areas of research. 

The following section is a collection of critical ideas from different areas of psy- 
chological mis/disinformation research. The purpose of these is to convey powerful 
ideas in the authors’ own words. We hope that readers will select those of interest 
and investigate those works further. 


2.8.1 Susceptibility and Resilience 


Below is a sample of recent studies on various aspects of exploitation via disinforma- 
tion which are summarized and quoted. We categorize these papers by overarching 
topics, but note that many bear relevant findings across multiple topics. 


Resilience to online disinformation: A framework for cross-national compara- 
tive research (Humprecht et al. 2020) 

This paper seeks to develop a cross-country framework for analyzing disinformation 
effects and population susceptibility and resilience. Three country clusters are iden- 
tified: media-supportive; politically consensus-driven (e.g., Canada, many Western 
European democracies); polarized (e.g., Italy, Spain, Greece); and low-trust, politi- 
cized, and fragmented (e.g., United States). Media-supportive countries in the first 
cluster are identified as most resilient to disinformation, and the United States is 
identified as most vulnerable to disinformation owing to its “high levels of populist 
communication, polarization, and low levels of trust in the news media.” 


Who falls for fake news? The roles of bullshit receptivity, overclaiming, famil- 
iarity, and analytic thinking (Pennycook and Rand 2020) 

The tendency to ascribe profundity to randomly generated sentences — pseudo- 
profound bullshit receptivity — correlates positively with perceptions of fake news 
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accuracy, and negatively with the ability to differentiate between fake and real news 
(media truth discernment). Relatedly, individuals who overclaim their level of knowl- 
edge also judge fake news to be more accurate. We also extend previous research 
indicating that analytic thinking correlates negatively with perceived accuracy by 
showing that this relationship is not moderated by the presence/absence of the head- 
line’s source (which has no effect on accuracy), or by familiarity with the headlines 
(which correlates positively with perceived accuracy of fake and real news). 


Fake news, fast and slow: Deliberation reduces belief in false (but not true) news 
headlines (Bago et al. 2020) 

The Motivated System 2 Reasoning (MS2R) account posits that deliberation causes 
people to fall for fake news, because reasoning facilitates identity-protective cogni- 
tion and is therefore used to rationalize content that is consistent with one’s political 
ideology. The classical account of reasoning instead posits that people ineffectively 
discern between true and false news headlines when they fail to deliberate (and instead 
rely on intuition)... Our data suggest that, in the context of fake news, deliberation 
facilitates accurate belief formation and not partisan bias. 


Political psychology in the digital (mis) information age: A model of news belief 
and sharing (Van Bavel et al. 2020) 

This paper reviews complex psychological risk factors associated with belief in mis- 
information such as partisan bias, polarization, political ideology, cognitive style, 
memory, morality, and emotion. The research analyzes the implications and risks 
of various solutions, such as fact-checking, “pre-bunking,” reflective thinking, de- 
platforming of bad actors, and media literacy. 


Misinformation and morality: encountering fake news headlines makes them 
seem less unethical to publish and share (Effron and Raj 2020) 

Experimental results and a pilot study indicate that “repeatedly encountering mis- 
information makes it seem less unethical to spread—-regardless of whether one 
believes it. Seeing a fake-news headline one or four times reduced how unethical 
participants thought it was to publish and share that headline when they saw it again 
—even when it was clearly labelled false and participants disbelieved it, and even after 
statistically accounting for judgments of how likeable and popular it was. In turn, 
perceiving it as less unethical predicted stronger inclinations to express approval of 
it online. People were also more likely to actually share repeated (vs. new) headlines 
in an experimental setting. We speculate that repeating blatant misinformation may 
reduce the moral condemnation it receives by making it feel intuitively true, and we 
discuss other potential mechanisms.” 


When is Disinformation (In) Credible? Experimental Findings on Message 
Characteristics and Individual Differences. (Schaewitz et al. 2020) 

“..we conducted an experiment (N = 294) to investigate effects of message factors 
and individual differences on individuals’ credibility and accuracy perceptions of 
disinformation as well as on their likelihood of sharing them. Results suggest that 
message factors, such as the source, inconsistencies, subjectivity, sensationalism, or 
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manipulated images, seem less important for users’ evaluations of disinformation 
articles than individual differences. While need for cognition seems most relevant 
for accuracy perceptions, the opinion toward the news topic seems most crucial for 
whether people believe in the news and share it online.” 


Long-term effectiveness of inoculation against misinformation: Three longitu- 
dinal experiments. (Maertens et al. 2020) 

“In 3 experiments (N = 151, N = 194, N = 170), participants played either Bad 
News (inoculation group) or Tetris (gamified control group) and rated the reliability 
of news headlines that either used a misinformation technique or not. We found 
that participants rate fake news as significantly less reliable after the intervention. In 
Experiment 1, we assessed participants at regular intervals to explore the longevity of 
this effect and found that the inoculation effect remains stable for at least 3 months. In 
Experiment 2, we sought to replicate these findings without regular testing and found 
significant decay over a 2-month time period so that the long-term inoculation effect 
was no longer significant. In Experiment 3, we replicated the inoculation effect and 
investigated whether long-term effects could be due to item-response memorization 
or the fake-to-real ratio of items presented, but found that this is not the case.” 


2.8.2 Case Studies: Threats and Actors 


Below one may find a collection of exposed current threats and actors utilizing 
manipulations of varying sophistication. 

The QAnon Conspiracy Theory: A Security Threat in the Making? (Combating 
Terrorism Center Amarnath Amarasingam, Marc-André Argentino 2021) 

Analysis of QAnon shows that, while disorganized, it is a significant public secu- 
rity threat due to historical violent events associated with the group, ease of access 
for recruitment, and a “crowdsourcing” effect wherein followers “take interpreta- 
tion and action into their own hands.” The origins and development of the group 
are analyzed, including review of adjacent organizations such as Omega Kingdom 
Ministries. Followers’ behavior is analyzed and five criminal cases are reviewed. 


Political Deepfake Videos Misinform the Public, But No More than Other Fake 
Media (Barari et al. 2021) 

“We demonstrate that political misinformation in the form of videos synthesized by 
deep learning (“deepfakes”) can convince the American public of scandals that never 
occurred at alarming rates — nearly 50% of a representative sample — but no more so 
than equivalent misinformation conveyed through existing news formats like textual 
headlines or audio recordings. Similarly, we confirm that motivated reasoning about 
the deepfake target’s identity (e.g., partisanship or gender) plays a key role in facili- 
tating persuasion, but, again, no more so than via existing news formats. ...Finally, a 
series of randomized interventions reveal that brief but specific informational treat- 
ments about deepfakes only sometimes attenuate deepfakes’ effects and in relatively 
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small scale. Above all else, broad literacy in politics and digital technology most 
strongly increases discernment between deepfakes and authentic videos of political 
elites.” 


Deepfakes and disinformation: exploring the impact of synthetic political video 
on deception, uncertainty, and trust in news (Vaccari and Chadwick 2020) 

An experiment using deepfakes to measure its value as a deception tool and its impact 
on public trust: “While we do not find evidence that deceptive political deepfakes 
misled our participants, they left many of them uncertain about the truthfulness of 
their content. And, in turn, we show that uncertainty of this kind results in lower 
levels of trust in news on social media.” 


Coordinating a Multi-Platform Disinformation Campaign: Internet Research 
Agency Activity on Three US Social Media Platforms, 2015-2017 (Lukito 2020) 
An analysis of the use of multiple modes as a technique to develop and deploy 
information attacks: [Abstract] “The following study explores IRA activity on three 
social media platforms, Facebook, Twitter, and Reddit, to understand how activities 
on these sites were temporally coordinated. Using a VAR analysis with Granger 
Causality tests, results show that IRA Reddit activity granger caused IRA Twitter 
activity within a one-week lag. One explanation may be that the Internet Research 
Agency is trial ballooning on one platform (i.e., Reddit) to figure out which messages 
are optimal to distribute on other social media (i.e., Twitter).” 


The disconcerting potential of online disinformation: Persuasive effects of astro- 
turfing comments and three strategies for inoculation against them (Zerback et al. 
2021) 

This paper examines online astroturfing as part of Russian electoral disinformation 
efforts, finding that online astroturfing was able to change opinions and increase 
uncertainty even when the audience was “inoculated” against disinformation prior 
to exposure. 


2.8.3 Dynamics of Exploitative Activities 


Below are some excerpts from literature covering the dynamics and kinetics of IO 
activities and their relationships to disinformation. 


Cultural Convergence: Insights into the behavior of misinformation networks 
on Twitter (McQuillan et al. 2020) 

“We use network mapping to detect accounts creating content surrounding COVID- 
19, then Latent Dirichlet Allocation to extract topics, and bridging centrality to iden- 
tify topical and non-topical bridges, before examining the distribution of each topic 
and bridge over time and applying Jensen-Shannon divergence of topic distributions 
to show communities that are converging in their topical narratives.” The primary 
topic under discussion within the data was COVID-19, and this report includes a 
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detailed analysis of “cultural bridging” activities among/between communities with 
conspiratorial views of the pandemic. 


An automated pipeline for the discovery of conspiracy and conspiracy the- 
ory narrative frameworks: Bridgegate, Pizzagate and storytelling on the web 
(Tangherlini et al. 2020) 

“Predicating our work on narrative theory, we present an automated pipeline for 
the discovery and description of the generative narrative frameworks of conspiracy 
theories that circulate on social media, and actual conspiracies reported in the news 
media. We base this work on two separate comprehensive repositories of blog posts 
and news articles describing the well-known conspiracy theory Pizzagate from 2016, 
and the New Jersey political conspiracy Bridgegate from 2013. ...We show how the 
Pizzagate framework relies on the conspiracy theorists’ interpretation of “hidden 
knowledge” to link otherwise unlinked domains of human interaction, and hypothe- 
size that this multi-domain focus is an important feature of conspiracy theories. We 
contrast this to the single domain focus of an actual conspiracy. While Pizzagate 
relies on the alignment of multiple domains, Bridgegate remains firmly rooted in the 
single domain of New Jersey politics. We hypothesize that the narrative framework 
of a conspiracy theory might stabilize quickly in contrast to the narrative framework 
of an actual conspiracy, which might develop more slowly as revelations come to 
light.” 

Fake news early detection: A theory-driven model (Zhou et al. 2020) 

“The method investigates news content at various levels: lexicon-level, syntax-level, 
semantic-level, and discourse-level. We represent news at each level, relying on well- 
established theories in social and forensic psychology. Fake news detection is then 
conducted within a supervised machine learning framework. As an interdisciplinary 
research, our work explores potential fake news patterns, enhances the interpretability 
in fake news feature engineering, and studies the relationships among fake news, 
deception/disinformation, and clickbaits. Experiments conducted on two real-world 
datasets indicate the proposed method can outperform the state-of-the-art and enable 
fake news early detection when there is limited content information” 

“...Among content-based models, we observe that (3) the proposed model performs 
comparatively well in predicting fake news with limited prior knowledge. We also 
observe that (4) similar to deception, fake news differs in content style, quality, and 
sentiment from the truth, while it carries similar levels of cognitive and percep- 
tual information compared to the truth. (5) Similar to clickbaits, fake news headlines 
present higher sensationalism and lower newsworthiness while their readability char- 
acteristics are complex and difficult to be directly concluded. In addition, fake news 
(6) is often matched with shorter words and longer sentences.” 


A survey of fake news: Fundamental theories, detection methods, and opportu- 
nities (Zhou and Zafarani 2020) 

“By involving dissemination information (i.e., social context) in fake news detection, 
propagation-based methods are more robust against writing style manipulation by 
malicious entities. However, propagation-based fake news detection is inefficient for 
fake news early detection ... as it is difficult for propagation-based models to detect 
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fake news before it has been disseminated, or to perform well when limited news 
dissemination information is available. Furthermore, mining news propagation and 
news writing style allow one to assess news intention. As discussed, the intuition 
is that (1) news created with a malicious intent, that is, to mislead and deceive the 
public, aims to be “more persuasive” compared to those not having such aims, and 
(2) malicious users often play a part in the propagation of fake news to enhance its 
social influence.” This paper also emphasizes the need for explainable detection as 
a matter of public interest. 


A Signal Detection Approach to Understanding the Identification of Fake News 
(Batailler et al. 2020) 

“The current article discusses the value of Signal Detection Theory (SDT) in disen- 
tangling two distinct aspects in the identification of fake news: (1) ability to accurately 
distinguish between real news and fake news and (2) response biases to judge news 
as real versus fake regardless of news veracity. The value of SDT for understanding 
the determinants of fake news beliefs is illustrated with reanalyses of existing data 
sets, providing more nuanced insights into how partisan bias, cognitive reflection, 
and prior exposure influence the identification of fake news.” 

Sharing of fake news on social media: Application of the honeycomb framework 
and the third-person effect hypothesis (Talwar et al. 2020) 

This publication uses the “honeycomb framework” of Kietzmann, Hermkens, 
McCarthy, and Silvestre (Kietzmann et al. 2011) to explain sharing behaviors related 
to fake news. Qualitative evaluation supports the explanatory value of the honeycomb 
framework, and the authors are able to identify a number of quantitative associations 
between demographics and sharing characteristics. 


Modeling echo chambers and polarization dynamics in social networks 
(Baumann et al. 2020) 

“We propose a model that introduces the dynamics of radicalization, as a reinforcing 
mechanism driving the evolution to extreme opinions from moderate initial condi- 
tions. Inspired by empirical findings on social interaction dynamics, we consider 
agents characterized by heterogeneous activities and homophily. We show that the 
transition between a global consensus and emerging radicalized states is mostly 
governed by social influence and by the controversialness of the topic discussed. 
Compared with empirical data of polarized debates on Twitter, the model qualita- 
tively reproduces the observed relation between users’ engagement and opinions, as 
well as opinion segregation in the interaction network. Our findings shed light on 
the mechanisms that may lie at the core of the emergence of echo chambers and 
polarization in social media.” 


(Mis)representing Ideology on Twitter: How Social Influence Shapes Online 
Political Expression (Guess 2021) 

In comparing users’ tweets as well as those of the accounts that they follow, alongside 
self-reported political affiliations from surveys, a distinction between self-reported 
political slant and those views expressed in tweets. This raises the notion that public 
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political expression may be distinct from individual affiliations, and that an indi- 
vidual's tweets may not be an accurate proxy for their beliefs. The authors call 
for a re-examination of potential social causes of/influences upon public political 
expression: [from Abstract] “we find evidence consistent with the hypothesis that 
users’ public expression is powerfully shaped by their followers, independent of the 
political ideology they report identifying with in attitudinal surveys. Finally, we find 
that users’ ideological expression is more extreme when they perceive their Twitter 
networks to be relatively like-minded.” 


2.8.4 Meta-Review 


A modern review on the literature surrounding disinformation may be found below. 


A systematic literature review on disinformation: Toward a unified taxonomical 
framework (Kapantai et al. 2021) 

“Our online information landscape is characterized by a variety of different types of 
false information. There is no commonly agreed typology framework, specific cate- 
gorization criteria, and explicit definitions as a basis to assist the further investigation 
of the area. Our work is focused on filling this need. Our contribution is twofold. 
First, we collect the various implicit and explicit disinformation typologies proposed 
by scholars. We consolidate the findings following certain design principles to articu- 
late an all-inclusive disinformation typology. Second, we propose three independent 
dimensions with controlled values per dimension as categorization criteria for all 
types of disinformation. The taxonomy can promote and support further multidisci- 
plinary research to analyze the special characteristics of the identified disinformation 


types.” 


2.9 Conclusion 


While media forensics as a field is generally occupied by engineers and computer 
scientists, it intersects with a variety of other disciplines, and its importance is crucial 
to the ability for people to form consensus and collaboratively solve problems, from 
climate change to global health. 

John Locke, an English philosopher and influential thinker on empiricism and the 
Enlightenment, once wrote that “the improvement of understanding is for two ends: 
first, our own increase of knowledge; secondly, to enable us to deliver that knowledge 
to others.” Indeed, your study of media forensics takes Locke’s formulation a step 
further. The effort you put in to understanding this field will not only increase your 
own knowledge, it will allow you to build an information ecosystem that delivers 
that knowledge to others—the knowledge necessary to fuel human discourse. 

The work is urgent. 
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Chapter 3 A) 
Computational Imaging TAE 


Scott McCloskey 


Since the advent of smartphones, photography is increasingly being done with small, 
portable, multi-function devices. Relative to the purpose-built cameras that domi- 
nated previous eras, smartphone cameras must overcome challenges related to their 
small form factor. Smartphone cameras have small apertures that produce a wide 
depth of field, small sensors with rolling shutters that lead to motion artifacts, and 
small form factors which lead to more camera shake during exposure. Along with 
these challenges, smartphone cameras have the advantage of tight integration with 
additional sensors and the availability of significant computational resources. For 
these reasons, the field of computational imaging has advanced significantly in recent 
years, with academic groups and researchers from smartphone manufacturers helping 
these devices become more capable replacements for purpose-built cameras. 


3.1 Introduction to Computational Imaging 


Computational imaging (or computational photographic) approaches are character- 
ized by the co-optimization of what and how the sensor captures light and how 
that signal is processed in software. Computational imaging approaches now com- 
monly available on most smartphones include panoramic stitching, multi-frame high 
dynamic range (HDR) imaging, ‘portrait mode’ for low depth of field, and multi- 
frame low-light imaging (‘night sight’ or ‘night mode’). 

As image quality is viewed as a differentiating feature between competing smart- 
phone brands, there has been tremendous progress improving subjective image qual- 
ity, accompanied by a lack of transparency due to the proprietary nature of the work. 
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Like the smartphone market more generally, computational imaging approaches used 
therein change very quickly from one generation of the phone to the next. This com- 
bination of different modes, and the proprietary and rapid nature of changes, all pose 
challenges for forensics practitioners. 

This chapter investigates some of the forensics challenges that arise from the 
increased adoption of computational imaging and assesses our ability to detect auto- 
mated focus manipulations performed by computational cameras. We then look more 
generally at a cue to help distinguish optical blur from synthetic blur. The chapter 
concludes with a look at some early computational imaging research that may impact 
forensics in the near future. 

This chapter will not focus on the definitional and policy issues related to compu- 
tational imagery, believing that these issues are best addressed within the context of 
a specific forensic application. The use of an automated, aesthetically driven compu- 
tational imaging mode does not necessarily imply nefarious intent on the part of the 
photographer in all cases, but may in some. This suggests that broad-scale screening 
for manipulated imagery (e.g., on social media platforms) might not target portrait 
mode images for detection and further scrutiny, but in the context of an insurance 
claim or a court proceeding, higher scrutiny may be necessary. 

As with the increase in the number, complexity, and ease of use of software pack- 
ages for image manipulation, a key forensic challenge presented by computational 
imaging is the degree to which they democratize the creation of manipulated media. 
Prior to these developments, the creation of a convincing manipulation required a 
knowledgeable user dedicating a significant amount of time and effort with only 
partially automated tools. Much like Kodak cameras greatly simplified consumer 
photography with the motto “You Press the Button, We Do the Rest,’ computational 
cameras allow users to create—with the push of a button—imagery that’s inconsis- 
tent with the classically understood physical limitations of the camera. Realizing that 
most people won’t carry around a heavy camera with a large lens, a common goal 
of computational photography is to replicate the aesthetics of Digital Single Lens 
Reflex (DSLR) imagery using the much smaller and lighter sensors and optics used 
in mobile phones. From this, one of the significant achievements of computational 
imaging is the ability to replicate the shallow depth of field of a large aperture DSLR 
lens on a smartphone with a tiny aperture. 


3.2 Automation of Geometrically Correct Synthetic Blur 


Optical blur is a perceptual cue to depth (Pentland 1987), a limiting factor in the 
performance of computer vision systems (Bourlai 2016), and an aesthetic tool that 
photographers use to separate foreground and background parts of a scene. Because 
smartphone sensors and apertures are so small, images taken by mobile phones 
often appear to be in focus everywhere, including background elements that distract 
a viewer’s attention. To draw viewers’ perceptual attention to a particular object 
and avoid distractions from background objects, smartphones now include ‘portrait 
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Fig. 3.1 (Left) For a close-in scene captured by a smartphone with a small aperture, all parts of 
the image will appear in focus. (Right) In order to mimic the aesthetically pleasing properties of 
a DSLR with a wide aperture, computational cameras now include a “portrait mode” that blurs the 
background so the foreground stands out better 


mode’ which automatically manipulates the sensor image via the application of spa- 
tially varying blur. Figure 3.1 shows an example of the all-in-focus natively captured 
by the sensor and the resulting portrait mode image. Because there is a geometric 
relationship between optical blur and depth in the scene, the creation of geometrically 
correct synthetic blur requires knowledge of the 3D scene structure. Consistent with 
the spirit of computational cameras, portrait modes enable this with a combination 
of hardware and software. Hardware acquires the scene in 3D using either stereo 
cameras or sensors with the angular sensitivity of a light field camera (Ng 2006). 
Image processing software uses this 3D information to infer and apply the correct 
amount of blur for each part of the scene. The result is automated, geometrically 
correct optical blur that looks very convincing. A natural question, then, is whether 
or not we can accurately differentiate “portrait mode’-based imagery from genuine 
optical blur. 


3.2.1 Primary Cue: Image Noise 


Regardless of how the local blurring is implemented, the key difference between 
optical blur and portrait mode-type processing can be found in image noise. When 
blur happens optically, before photons reach the sensor, only small signal-dependent 
noise impacts are observed. When blur is applied algorithmically to an already digi- 
tized image, however, the smoothing or filtering operation also implicitly de-noises 
the image. Since the amount of de-noising is proportional to the amount of local 
smoothing or blurring, differences in the amount of algorithmic local blur can be 
detected via inconsistencies between the local intensity and noise level. Two regions 
of the image having approximately the same intensity should also have approxi- 
mately the same level of noise. If one region is blurred more than the other, or one is 


44 S. McCloskey 


blurred while the other is not, an inconsistency is introduced between the intensities 
and local noise levels. 

For our noise analysis, we extend the combined noise models of Tsin et al. (2001) 
and Liu et al. (2006). Ideally, a pixel produces a number of electrons Enum propor- 
tional to the average irradiance from the object being imaged. However, shot noise 
Ns is a result of the quantum nature of light and captures the uncertainty in the 
number of electrons stored at a collection site; Ns can be modeled as the Poisson 
noise. Additionally, site-to-site non-uniformities called fixed pattern noise K are a 
multiplicative factor impacting the number of electrons; K can be characterized as 
having mean | and a small spatial variance ox over all of the collection sites. Ther- 
mal energy in silicon generates free electrons which contribute dark current to the 
image; this is modeled as an additive factor Npc, modeled as the Gaussian noise. 
The on-chip output amplifier sequentially transforms the charge collected at each 
site into a measurable voltage with a scale A, and the amplifier generates zero mean 
read-out noise Nr with variance Ons Demosaicing is applied in color cameras to 
interpolate two of the three colors at each pixel, and it introduces an error that is 
sometimes modeled as noise. After this, the camera response function (CRF) /(-) 
maps this voltage via a non-linear transform to improve perceptual image quality. 
Lastly, the analog-to-digital converter (ADC) approximates the analog voltage as an 
integer multiple of a quantization step g. The quantization noise can be modeled as 
the addition of a noise source No. 

With these noise sources in mind, we can describe a digitized 2D image as follows: 


D(x, y) = f ((K Œ, Y) Enum, y) + Noc, y) + Ns(z, y) + Nae, y) A) 
+ No(x, y) (3.1) 


The variance of the noise is given by 


2 
q 
rr, y= £2(A(K (x, Y) Enum, y) + E[Npc (x, y)] + o2)) =F 12 (3.2) 
where E[-] is the expectation function. This equation tells us two things which are 
typically overlooked in the more simplistic model of noise as an additive Gaussian 
source: 


1. The noise variance’s relationship with intensity reveals the shape of the CRF’s 
derivative f’. 
2. Noise has a signal-dependent aspect to it, as evidenced by the Enum term. 


An important corollary to this is that different levels of noise in regions of an image 
having different intensities is not per se an indicator of manipulation, though it has 
been taken as one in past work (Mahdian and Saic 2009). We show in our experiments 
that, while the noise inconsistency cue from Mahdian and Saic (2009) has some pre- 
dictive power in detecting manipulations, a proper accounting for signal-dependent 
noise via its relationship with image intensity significantly improves accuracy. 
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Measuring noise in an image is, of course, ill-posed and is equivalent to the 
long-standing image de-noising problem. For this reason, we leverage three dif- 
ferent approximations of local noise, measured over approximately uniform image 
regions: intensity variance, intensity gradient magnitude, and the noise feature of 
Mahdian and Saic (2009) (abbreviated NOI). Each of these is related to the image 
intensity of the corresponding region via a 2D histogram. This step translates subtle 
statistical relationships in the image to shape features in the 2D histograms which 
can be classified by a neural network. As we show in the experiments, our detec- 
tion performance on histogram features significantly improves on that of popular 
approaches applied directly to the pixels of the image. 


3.2.2 Additional Photo Forensic Cues 


One of the key challenges in forensic analysis of images ‘in the wild’ is that compres- 
sion and other post-processing may overwhelm subtle forgery cues. Indeed, noise 
features are inherently sensitive to compression which, like blur, smooths the image. 
In order to improve detection performance in such challenging cases, we incorpo- 
rate additional forensic cues which improve our method’s robustness. Some portrait 
mode implementations appear to operate on a JPEG image as input, meaning that the 
outputs exhibit cues related to double JPEG compression. As such, there is a range 
of different cues that can reveal manipulations in a subset of the data. 


3.2.2.1 Demosaicing Artifacts 


Forensic researchers have shown that the differences between the demosaicing algo- 
rithm and the differences between the physical color filter array bonded to the sensor 
can be detected from the image. Since focus manipulations are applied on the demo- 
saiced images, the local smoothing operations will alter these subtle Color Filter 
Array (CFA) demosaicing artifacts. In particular, the lack of CFA artifacts or the 
detection of weak, spatially varying CFA artifacts indicates the presence of global 
or local tampering, respectively. 

Following the method of Dirik and Memon (2009), we consider the demosaicing 
scheme f4 being a bilinear interpolation. We divide the image into W x W sub- 
blocks and only compute the demosaicing feature at the non-smooth blocks of pixels. 
Denote each non-smooth block as B;, where i = 1, ..., mg, and mpg is the number 
of non-smooth blocks in the image. The re-interpolation error of i-th sub-block for 
the k-th CFA pattern 0, is defined as Bix = fa(B,, 0.) and k = 1,...,4. The MSE 
error matrix EP (k,c),c € R, G, B between the blocks B and B is computed in 
non-smooth regions all over the image. Therefore, we define the metric to estimate 
the uniformity of normalized green channel column vector as 
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3.2.2.2 JPEG Artifact 


In some portrait mode implementations, such as the iPhone, the option to save both 
an original and a portrait mode image of the same scene suggests that post-processing 
is applied after JPEG compression. Importantly, both the original JPEG image and 
the processed version are saved in the JPEG format without resizing. Hence, Discrete 
Cosine Transform (DCT) coefficients representing un-modified areas will undergo 
two consecutive JPEG compressions and exhibit double quantization (DQ) artifacts, 
used extensively in the forensics literature. DCT coefficients of locally blurred areas, 
on the other hand, will result from non-consecutive compressions and will present 
weaker artifacts. 

We follow the work of Bianchi et al. (2011) and use the Bayesian inference to 
assign to each DCT coefficient a probability of being doubly quantized. Accumulated 
over each 8 x 8 block of pixels, the DQ probability map allows us to distinguish 
original areas (having high DQ probability) from tampered areas (having low DQ 
probability). The probability of a block being tampered can be estimated as 


p= u/( I] (R(m;) — L(mi)) * kg(mj) + 1) 
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where m is the value of the DCT coefficient; k,(-) is a Gaussian kernel with standard 
deviation o,/ Q2; Qı and Q> are the quantization steps used in the first and second 
compression, respectively; b is the bias; and u is the unquantized DC coefficient. 
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3.2.3 Focus Manipulation Detection 


To summarize the analysis above, we adopt five types of features: color variance 
(VAR), image gradient (GRAD), double quantization (ADQ) (Bianchi et al. 2011), 
color filter artifacts (CFA) (Dirik and Memon 2009), and noise inconsistencies (NOD) 
(Mahdian and Saic 2009) for refocusing detection. Each of these features is computed 
densely at each location in the image, and Fig. 3.2 illustrates the magnitude of these 
features in a feature map for an authentic image (top row) and a portrait mode image 
(middle row). Though there are notable differences between the feature maps in these 
two rows, there is no clear indication of a manipulation except, perhaps, the ADQ 
feature. And, as mentioned above, the ADQ cue is fragile because it depends on 
whether blurring is applied after an initial compression. 

As mentioned in Sect. 3.2.1, the noise cues are signal-dependent in the sense that 
blurring introduces an inconsistency between intensity and noise levels. To illustrate 
this, Fig. 3.2’s third row shows scatter plots of the relationship between intensity (on 
the horizontal axis) and the various features (on the vertical axis). In these plots, par- 
ticularly the columns related to noise (Variance, Gradient, and NOD), the distinction 
between the statistics of the authentic image (blue symbols) and the manipulated 
image (red symbols) becomes quite clear. Noise is reduced in most of the images, 
though the un-modified foreground region (the red bowl) maintains relatively higher 
noise because it is not blurred. Note also that the noise levels across the manipulated 
image are actually more consistent than in the authentic image, showing that previous 
noise-based forensics (Mahdian and Saic 2009) are ineffective. 


Variance Gradient ADQ 


. ; Manipulated _ Authentic 


Fig. 3.2 Feature maps and histogram for authentic and manipulated images. On the first row are the 
authentic image feature maps; the second row shows the corresponding maps for the manipulated 
image. We show scatter plots relating the features to intensity in the third row, where blue sample 
points correspond to the authentic image, and red corresponds to a manipulated DoF image (which 
was taken with an iPhone) 
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Fig. 3.3 Manipulated refocusing image detection pipeline. The example shown is an iPhone7plus 
portrait mode image 


Figure 3.3 shows our portrait mode detection pipeline, which incorporates these 
five features. In order to capture the relationship between individual features and 
the underlying image intensity, we employ an intensity versus feature bivariate 
histogram—which we call the focus manipulation inconsistency histogram (FMIH). 
We use FMIH for all five features for defocus forgery image detection, each of which 
is analyzed by a neural network called FMIHNet. These five classification results 
are combined by a majority voting scheme to determine a final classification label. 

After densely computing the VAR, GRAD, ADQ, CFA, and NOI features for 
each input image (shown in the first five columns of the second row of Fig.3.3), we 
partition the input image into superpixels and, for each superpixel isp, we compute 
the mean F (isp) of each feature measure and its mean intensity. Finally, we generate 
the FMIH for each of the five figures, shown in the five columns of the third row of 
Fig.3.3. Note that the FMIH are flipped vertically with respect to the scatter plots 
shown in Fig. 3.2. A comparison of the FMIH extracted features from the same scene 
captured with different cameras is shown in Fig. 3.4. 
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Fig. 3.4 Extracted FMIHs for the five feature measures with images captured using Canon60D, 
iPhone7Plus, and HuaweiMate9 cameras 


3.2.3.1 Network Architectures 


We have designed a FMIHNet, illustrated in Fig.3.5, for the five histogram features. 
Our network is a VGG (Simonyan and Zisserman 2014) style network consisting of 
convolutional (CONV) layers with small receptive fields (3 x 3). During training, 
the input to our FMIHNet is a fixed-size 101 x 202 FMIH. The FMIHNet is a fusion 
of two relatively deep sub-networks: FMIHNet1 with 20 CONV layers for VAR 
and CFA features, and FMIHNet2 with 30 CONV layers for GRAD, ADQ, and 
NOI features. The CONV stride is fixed to 1 pixel; the spatial padding of the input 
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Fig. 3.5 Network architecture: FMIHNet! for Var and CFA features; FMIHNet2 for Grad, ADQ, 
and NOI features 


FMIHNet1 


EMIHNet2 
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features is set to 24 pixels to preserve the spatial resolution. Spatial pooling is carried 
out by five max-pooling layers, performed over a 2 x 2 pixel window, with stride 
2. A stack of CONV layers followed by one Fully Connected (FC) layer performs 
two-way classification. The final layer is the soft-max layer. All hidden layers have 
rectification (ReLU) non-linearity. 

There are two reasons for the small 3 x 3 receptive fields: first, incorporating 
multiple non-linear rectification layers instead of a single one makes the decision 
function more discriminative; secondly, this reduces the number of parameters. This 
can be seen as imposing a regularization on a later CONV layer by forcing it to have 
a decomposition through the 3 x 3 filters. 

Because most of the values in our FMIH are zeros (i.e., most cells in the 2D 
histogram are empty) and because we only have two output classes (authentic and 
portrait mode), more FC layers seem to degrade the training performance. 


3.2.4 Portrait Mode Detection Experiments 


Having introduced a new method to detect focus manipulations, we first demonstrate 
that our method can accurately identify manipulated images even if they are geomet- 
rically correct. Here, we also show that our method is more accurate than both past 
forensic methods (Bianchi et al. 2011; Dirik and Memon 2009; Mahdian and Saic 
2009) and the modern vision baseline of CNN classification applied directly to the 
image pixels. Second, having claimed that the photometric relationship of noise cues 
with the image intensity is important, we will show that our FMIH histograms are a 
more useful representation of these cues. 

To demonstrate our performance on the hard cases of geometrically correct focus 
manipulations, we have built a focus manipulation dataset (FMD) of images captured 
with a Canon 60D DSLR and two smartphones having dual lens camera-enabled 
portrait modes: the iPhone7Plus and the Huawei Mate9. Images from the DSLR 
represent real shallow DoF images, having been taken with focal lengths in the range 
17-70 and f numbers in the range F/2.8-F/5.6. The iPhone was used to capture aligned 
pairs of authentic and manipulated images using portrait mode. The Mate9 was also 
used to capture authentic/manipulated image pairs, but these are only approximately 
aligned due to its inability to save the image both before and after portrait mode 
editing. 

We use 1320 such images for training and 840 images for testing. The training 
set consists of 660 authentic images (220 from each of the three cameras) and 660 
manipulated images (330 from each of iPhone7Plus and Huawei Mate9). The test 
set consists of 420 authentic images (140 from each of the three cameras) and 420 
manipulated images (140 from each of iPhone7Plus and HuaweiMate9). 

Figure 3.4 shows five sample images from FMD and illustrates the perceptual real- 
ism of the manipulated images. The first row of Table 3.1 quantifies this performance 
and shows that a range of CNN models (AlexNet, CaffeNet, VGG16, and VGG19) 
have classification accuracies in the range of 76-78%. Since our method uses five 
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Table 3.1 Image classification accuracy on FMD 


Data AlexNet CaffeNet VGG16 VGG19 
Image 0.784 
VAR map 0.726 
GRAD map 0.769 
ADQ map 0.740 
CFA map 0.785 
NOI map 0.707 0.765 0.745 0.760 


Table 3.2 FMIH classification accuracy on FMD 
Data SVM AlexNet | CaffeNet VGG19 LeNet FMIHNet 
VAR 0.829 0.480 0.480 0.512 0.635 0.850 
GRAD 0.909 0.503 0.500 0.486 0.846 0.954 
ADQ 0.909 0.496 0.520 0.511 0.844 0.946 


CFA 0.882 0.510 0.520 0.510 0.871 0.919 
NOI 0.858 0.497 0.506 0.530 0.779 0.967 
Vote 0.942 — — — — 0.888 0.982 


different feature maps which can easily be interpreted as images, the remaining rows 
of Table 3.1 show the classification accuracies of the same CNN models applied 
to these feature maps. The accuracies are slightly lower than for the image-based 
classification. 

In Sect. 3.2.1, we claimed that a proper accounting for signal-dependent noise via 
our FMIH histograms improves upon the performance of the underlying features. 
This is demonstrated by comparing the image- and feature map-based classification 
performance of Table 3.1 with the FMIH-based classification performance shown in 
Table 3.2. Using FMIH, even the relatively simple SVMs and LeNet CNNs deliver 
classification accuracies in the 80-90% range. Our FMIHNet architecture produces 
significantly better results than these, with our method’s voting output having a 
classification accuracy of 98%. 


3.2.5 Conclusions on Detecting Geometrically Correct 
Synthetic Blur 


We have presented a novel framework to detect focus manipulations, which represent 
an increasingly difficult and important forensics challenge in light of the availabil- 
ity of new camera hardware. Our approach exploits photometric histogram features, 
with a particular emphasis on noise, whose shapes are altered by the manipula- 
tion process. We have adopted a deep learning approach that classifies these 2D 
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histograms separately and then votes for a final classification. To evaluate this, we 
have produced a new focus manipulation dataset with images from a Canon60D 
DSLR, iPhone7Plus, and HuaweiMate9. This dataset includes manipulations, par- 
ticularly from the iPhone portrait mode, that are geometrically correct due to the 
use of dual lens capture devices. Despite the challenge of detecting manipulations 
that are geometrically correct, our method’s accuracy is 98%, significantly better 
than image-based detection with a range of CNNs, and better than prior forensics 
methods. 

While these experiments make it clear that the detection of imagery generated by 
first-generation portrait modes is reliable, it’s less clear that this detection capability 
will hold up as algorithmic and hardware improvements are made to mobile phones. 
This raises the question of how detectable digital blurring (i.e., blurring performed 
in software) is compared to optical blur. 


3.33 Differences Between Optical and Digital Blur 


As mentioned in the noise analysis earlier in this chapter, the shape of the Camera 
Response Function (CRF) shows up in image statistics. The non-linearity of the CRF 
impacts more than just noise and, in particular, its effect is now well-understood on 
motion deblurring (Chen et al. 2012; Tai et al. 2013). While that work showed how 
an unknown CRF is a noise source in blur estimation, we will demonstrate that it is 
a key signal helping to distinguish authentic edges from software-blurred gradients, 
particularly at splicing boundaries. This allows us to identify forgeries that involve 
artificially blurred edges, in addition to ones with artificially sharp transitions. 

We first consider blurred edges: for images captured by a real camera, the blur 
is applied to scene irradiance as the sensor is exposed, and it is then mapped to 
image intensity via the CRF. There are two operations involved, CRF mapping and 
optical/manual blur convolution. By contrast, in a forged image, the irradiance of 
a sharp edge is mapped to intensity first and then blurred by the manual blur point 
spread function (PSF). The key to distinguishing these is that the CRF is a non-linear 
operator, so it is not commutative, i.e., applying the PSF before the CRF leads to a 
different result than applying the CRF first, even if the underlying signal is the same. 

The key difference is that the profile of an authentic blurred edge (CRF before 
PSF) is asymmetric w.r.t the center location of an edge, whereas an artificially blurred 
edge is symmetric, as illustrated in Fig. 3.6. This is because, due to its non-linearity, 
the slope of the CRF is different across the range of intensities. CRFs typically have a 
larger gradient in low intensities, small gradient in high intensities, and approximately 
constant slope in the mid-tones. In order to capture this, we use the statistical bivariate 
histogram of pixel intensity versus gradient magnitude, which we call the intensity 
gradient histogram (IGH), and use it as the feature for synthetic blur detection around 
edges in an image. The IGH captures key information about the CRF without having 
to explicitly estimate it and, as we show later, its shape differs between natural and 
artificial edge profiles in a way that can be detected with existing CNN architectures. 
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Fig. 3.6 Authentic blur (blue), authentic sharp (cyan), forgery blur (red) and forgery sharp 
(magenta) edge profiles (left), gradients (center), and IGHs (right) 


Before starting the theoretical analysis, we clarify our notation. For simplicity, we 
study the role of the operations ordering on a 1D edge. We denote the CRF, which is 
assumed to be a non-linear monotonically increasing function, as f (r), normalized 
to satisfy f (0) = 0, f(1) = 1. And the inverse CRF is denoted as g(R) = f (R), 
satisfying g(0) = 0, f (g(1)) = 1.r represents irradiance, and R intensity. We assume 
that the optical blur PSF is a Gaussian function, having confirmed that the blur tools 
in popular image manipulation software packages use a Gaussian kernel, 


1 Ean 
e -2° (3.5) 


K(x) = 
i OoN 2T 


When the edge in irradiance space r is a step edge, like the one shown in green in 
Fig. 3.6, 
a x<c 


H,, p (x) = f eae (3.6) 


where x is the pixel location. If we use a unit step function, 


0 «<0 
= 3:7. 
u(x) |: n (3.7) 
The step edge can be represented by 
H; PF (x) = (b — a)u(x — c) +a (3.8) 


3.3.1 Authentically Blurred Edges 


An authentically blurred (ab) edge is 


la = f (Kg) * A ytep (X)) (3.9) 
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where * represents convolution. The gradient of this 


d[K,(x) * step (x)] 

dx 

d step (x) 
dx 


V Ia = f (KŒ) * H,, p (X)) Í 
= f'(K,(x) * Hstep(x)) I K,(x) * 


Because the differential of the step edge is a delta function, 


d Astep (x) 


= (b — a)ó(x — c) (3.10) 
dx 


We have 


Viab = f'(Kg(x) * Hstep(x)) i K,(x) * (b — a)ó(x — c) 
= (b—- a) f'(K,(x) * A ytep(X)) I K,(x —c) 


Substituting (3.9) into above equation, we have the relationship between Jap and 
gradient V Igy 


Vla = (b — a) f'(f~" ab) » K.G — c) (3.11) 


And K,(x — c) is just shifting the blur kernel to the location of the step. Because f 
is non-linear and its gradient is large at lower irradiance and small at higher irradiance, 
f’ is asymmetric. Therefore, IGH of authentically blurred edge is asymmetric, as 
shown in blue in Fig. 3.6. 


3.3.2 Authentic Sharp Edge 


In real images, the assumption that a sharp edge is a step function does not hold. 
Some (Ng et al. 2007) assume a sigmoid function, for simplicity, whereas we choose 
a small Gaussian kernel to approximate the authentic sharp (as) edge. 


Tas = f Ks (x) * Hstep(x)) (3.12) 


The IGH is the same as that of an authentic blur edge (3.11), with the size and c 
of the kernel being smaller. Because the blur extent is very small, there is a small 
transition edge region, and the effect of CRF will not be very obvious. The IGH 
remains symmetric shown as cyan in Fig. 3.6. 
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3.3.3 Forged Blurred Edge 
The model for a forged blurred (fb) edge is 


Tro = K,(x) * J (Astep(*)) 


The gradient of this is 


= d[K,G@) * J Astep(*))] 
_ dx 
= K,(x) * Lf’ (Hstep(x)) ý Fo)! 


VI 


Because 
f'(H,, Q) = (f' (b) — f'(a))u(x — c) + f'a) 
and 
JG): d(x) = FO) - d(x), 
we have 


V1 po = Kg(x) * | f'(b)óG@ — c)] 
= (b — a) f(b) Kg(x) * d(x — c) 
= (b—a) f'(b)K,(x — c). 
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(3.13) 


(3.14) 


Clearly, V I p, has the shape the same as the PSF kernel, which is symmetric w.r.t. 


the location of the step c, shown as red in Fig. 3.6. 


3.3.4 Forged Sharp Edge 


A forged sharp (fs) edge appears as an abrupt jump in intensity as 


Ifs = J (Astep(*)) 
The gradient of the spliced sharp boundary image is 


dl fs E dl f (Hstep(x))] 

dx dx 

= f'(Astep(x)) : (b — a)ó(x — c) 
= f'(b) -ôx — c) 


VIfs = 


(3.15) 
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There are only two intensities a and b, both having the same (large) gradient. The 
IGH of a forged sharp edge will only have all pixels fall in only two bins shown as 
magenta in Fig. 3.6. 


3.3.5 Distinguishing IGHs of the Edge Types 


To validate the theoretical analysis, we show IGHs of the four types of edges in 
Fig.3.7. An authentically blurred edge is blurred first, inducing intensity values 
between the low and high values in irradiance. This symmetric blur profile then is 
mapped through the CRF, becoming asymmetric. 

In the forged blur edge case, by contrast, the irradiance step edge is mapped by 
CRF first to a step edge in intensity space. Then the artificial blur (via PhotoShop, 
etc.) induces values between the two intensity extrema, and the profile reflects the 
symmetric shape of PSF. 

The forged sharp edge (from, for instance, a splicing operation) is an ideal step 
edge, whose nominal IGH has only two bins with non-zero counts. However, due 
to the existence of noise in images captured by cameras, shading, etc., the IGH of 
forgery sharp edge appears a rectangular shape as shown in Fig. 3.7. The horizontal 
line shows that the pixels fall into bins of different intensities with the same gradient, 
which is caused by the pixel intensity varying along the sharp edge. The two vertical 
lines show that the pixels fall into bins of different gradients with similar intensity 
values, which is caused by the pixels intensity varying on the constant color regions. 

As for the authentic sharp edge, the IGH is easily confused with a forged blurred 
edge, in that both are symmetric. If we only consider the shape of IGH, this would 
lead to a large number of false alarms. To disambiguate the two, we add an additional 
feature: the absolute value of the center intensity and gradient for each bin. This value 
helps due to the fact that at the same intensity value, the gradient of the blurred edge 
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Fig.3.7 The figure shows the four different edge classes and their IGH 
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is always smaller than the sharp edge. With our IGH feature, we are able to detect 
splicing boundaries that are hard for prior methods, such as spliced regions having 
constant color or those captured by the same camera. 


3.3.6 Classifying IGHs 


Having described the IGH and how these features differ between the four categories 
of edges, we now consider mechanisms to solve the inverse problem of inferring the 
edge category from an image patch containing an edge. Since our IGH is similar 
to a bag of words-type features used extensively in computer vision, SVMs are a 
natural classification mechanism to consider. In light of the recent success of powerful 
Convolutional Neural Networks (CNNs) applied directly to pixel arrays, we consider 
this approach, but find a limited performance that may be due to relatively sparse 
training data over all combinations of edge step height, blur extent, etc. To address 
this, we map the classification problem from the data-starved patch domain to a 
character shape recognition in the IGH domain, for which we leverage existing CNN 
architectures and training data. 


3.3.6.1 SVM Classification 


As in other vision applications, SVMs with the histogram intersection kernel (Maji 
et al. 2008) perform best for IGH classification. We unwrap our 2D IGH into a 
vector and append the center intensity and gradient value of each bin. Following the 
example of SVM-based classification on the bag of words features, we use a multi- 
classification scheme by training four one-vs-all models: authentic versus others, 
authentic sharp versus others, forged sharp versus others, and forged blur versus 
others. Then we combine the scores of all four models to classify the IGH as either 
authentic or a forgery. 


3.3.6.2 CNN on Edge Patches 


We evaluate whether a CNN applied directly to edge patches can out-perform meth- 
ods involving our IGH. We use the very popular Caffe (Jia et al. 2014) implementation 
for the classification task. 

We first train a CNN model on edge patches. In order to examine the difference 
between authentic and forged patches, we synthesized the authentic and forged blur 
processes on white, gray, and black edges as shown in Fig.3.8. Given the same 
irradiance map, we apply the same Gaussian blur and CRF mapping in both orderings. 
The white region in the authentically blurred image always appears to be larger than 
in the forgery blur image, because the CRF has a higher slope in the low-intensity 
region. That means all intensities of an authentically blurred edge are brought to the 
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Fig. 3.8 Authentic versus forgery images and edge profiles. The synthesized images are using the 
same CRF and Gaussian kernel. The real images are captured by Canon 70D. The forgery image is 
generated by blurring a sharp image the same amount to match the authentic image 


white point faster than a forgery. This effect can also be observed in real images, such 
as the Skype logo in Fig. 3.8, where the white letters in the real image are bolder than 
in the forgery. Another reason for this effect is that cameras have limited dynamic 
range. A forged sharp edge will appear like the irradiance map with step edges, while 
an authentically sharp edge would be easily confused with forgery blur since this 
CRF effect is only distinguishable with a relatively large amount of blur. Thus, the 
transition region around the edge potentially contains a cue for splicing detection in 
a CNN framework. 


3.3.6.3 CNN on IGHs 


Our final approach marries our IGH feature with the power of CNNSs. Usually, people 
wouldn’t use a histogram with CNN classification because spatial arrangements are 
important for other tasks, e.g., object recognition. But, as our analysis has shown, 
the cues relevant to our problem are found at the pixel level, and our IGH can be 
used to eliminate various nuisance factors: the orientation of the edge, the height of 
the step, etc. This has an advantage of reducing the training data dimensionality and, 
thus, the large number of training data needed to produce accurate models. Lacking 
an ImageNet-scale training set for forgeriesis a key advantage of our method. 

In the sense that a quantized IGH looks like various U-shaped characters, our 
approach reduces the problem of edge classification into a handwritten character 
recognition problem, which has been well studied (LeCun et al. 1998). Since we are 
only interested in the shape of the IGH, the LeNet model (LeCun et al. 1998) is very 
useful for the IGH recognition task. 
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3.3.7 Splicing Logo Dataset 


To validate our IGH-based classification approach, we built our own Splicing Logo 
Dataset (SpLogo) containing 1533 authentic images and 1277 forged blur images of 
logos with different colors and different amounts of blur. All the images are taken by 
a Canon70D. Optical blur is controlled via two different settings: one is by changing 
the focal plane (lens moving) and the other is by changing the aperture with the 
focal plane slightly off the logo plane. The logos are printed on a sheet of paper, 
and the focal plane is set to be parallel to the logo plane to eliminate the effect of 
depth-related defocus. Next, the digitally blurred images are generated to match the 
amount of optical blur through a optimization routine. 


3.3.8 Experiments Differentiating Optical and Digital Blur 


We use a histogram intersection kernel (Maji et al. 2008) for our SVM and LeNet 
(LeCun et al. 1998) for CNN witha le 5 base learning rate and lef maximum number 
of iterations. The training set contains 1000 authentic and 1000 forged patches. 

Table3.3 compares the accuracy of the different approaches described in 
Sect. 3.3.6. Somewhat surprisingly, the CNN applied directly to patches does the 
worst job of classification. Among the classification methods employing our hand- 
crafted feature, CNN classifiers obtain better results than SVM, and adding the abso- 
lute value of intensity and gradient to IGH increases classification accuracy. 


3.3.9 Conclusions: Differentiating Optical and Digital Blur 


These experiments show that, at a patch level, the IGH is an effective tool to detect 
digital blur around edge regions. How these patch-level detections relate to a higher- 
level forensics application will differ based on the objectives. For instance, to detect 
image splicing (with or without subsequent edge blurring), it would suffice to find 
strong evidence of one object with a digitally blurred edge. To detect globally blurred 
images, on the other hand, it would require evidence that all of the edges in the image 
are digitally blurred. In either case, the key to our detection method is that a non- 
linear CRF leads to differences in pixel-level image statistics depending on whether 


Table 3.3 Patch classification accuracy for SpLogo data and different classification approaches 
Classifier Image IGHnoV IGH 
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it is applied before or after blurring. Our IGH feature captures these statistics and 
provides a way to eliminate nuisance variables such as edge orientation and step 
height so that CNN methods can be applied despite a lack of a large training set. 


3.4 Additional Forensic Challenges from Computational 
Cameras 


Our experiments with blurred edges show that it's possible to differentiate blur cre- 
ated optically from blur added post-capture via signal processing. The key to doing 
so is understanding that the non-linearities of the CRF are not commutative with 
blur, which is modeled as a linear convolution. But what about the CRF itself? Our 
experiments involved imagery from a small number of devices, and past work (Hsu 
and Chang 2010) has shown that different devices have different CRF shapes. Does 
that mean that we need a training set of imagery from all different cameras, in order 
to be invariant to different CRFs? In subsequent work (Chen et al. 2019), we showed 
that CRFs from nearly 200 modern digital cameras had very similar shapes. While 
the CRFs between camera models differed enough to recognize the source camera 
when a calibration target (a color checker chart) was present, we later found that the 
small differences did not support reliable source camera recognition on ‘in the wild’ 
imagery. For forensic practitioners, the convergence of CRFs represents a mixed bag: 
their similarity makes IGH-like approaches more robust to different camera types, 
but reduces the value of CRFs as a cue to source camera model identification. 

Looking forward, computational cameras present additional challenges to foren- 
sics practitioners. As described elsewhere in this book, the development of deep 
neural networks that generate synthetic imagery is an important tool in the fight 
against misinformation. While Generative Adversarial Network (GAN)-based image 
detectors are quite effective at this point, they may become less so as computational 
cameras increasingly incorporate U-Net enhancements (Ronneberger et al. 2015) that 
are architecturally similar to GANs. It remains to be seen whether U-Net enhanced 
imagery will lead to increased false alarms from GAN-based image detectors, since 
the current generation of evaluation datasets don’t explicitly include such imagery. 

To the extent that PRNU and CFA-based forensics are important tools, new compu- 
tational imaging approaches challenge their effectiveness. Recent work from Google 
(Wronski et al. 2019), for instance, uses a tight coupling between multi-frame capture 
and image processing to remove the need for demosaicing and breaks the sensor-to- 
image pixel mapping that’s needed for PRNU to work. As these improvements make 
their way into mass market smartphone cameras, assumptions about the efficacy of 
classical forensic techniques will need to be re-assessed. 
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Chapter 4 ®) 
Sensor Fingerprints: Camera ta 
Identification and Beyond 


Matthias Kirchner 


Every imaging sensor introduces a certain amount of noise to the images it captures— 
slight fluctuations in the intensity of individual pixels even when the sensor plane 
was lit absolutely homogeneously. One of the breakthrough discoveries in multime- 
dia forensics is that photo-response non-uniformity (PRNU), a multiplicative noise 
component caused by inevitable variations in the manufacturing process of sensor 
elements, is essentially a sensor fingerprint that can be estimated from and detected 
in arbitrary images. This chapter reviews the rich body of literature on camera identi- 
fication from sensor noise fingerprints with an emphasis on still images from digital 
cameras and the evolving challenges in this domain. 


4.1 Introduction 


Sensor noise fingerprints have been a cornerstone of media forensics ever since Lukáš 
et al. (2005) observed that digital images can be traced back to their sensor based 
on unique noise characteristics. Minute manufacturing imperfections are believed to 
make every sensor physically unique, leading to the presence of a weak yet deter- 
ministic sensor pattern noise in images captured by the camera (Fridrich 2013). This 
fingerprint, commonly referred to as photo-response non-uniformity (PRNU), can be 
estimated from images captured by a specific camera for the purpose of source cam- 
era identification. As illustrated in Fig. 4.1, a noise signal is extracted in this process 
at test time from a probe image of unknown provenance, which can then be compared 
against pre-computed fingerprint estimates from a set of candidate cameras. 
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Fig. 4.1 Basic camera identification from PRNU-based sensor noise fingerprints (Fridrich 2013). 


(1) A camera fingerprint K is estimated from a number flatfield images taken by the camera of 
interest. (2) At test time, a noise residual W is extracted from a probe image of unknown provenance. 
(3) A detector decides whether or not the probe image originates from the camera of interest by 
evaluating a suitable similarity score, p 


Because PRNU emerges from physical noise-like properties of individual sensor 
elements, it has a number of attractive characteristics for source device attribution 
(Fridrich 2013). The fingerprint signal appears random and is of high dimensionality, 
which makes the probability of two sensors having the same fingerprint extremely 
low (Goljan et al. 2009). At the same time, it can be assumed that all common imaging 
sensor types exhibit PRNU, that each sensor output contains a PRNU component, 
except for completely dark or saturated images, and that PRNU fingerprints are stable 
over time (Lukás et al. 2006). Finally, various independent studies have found that 
the fingerprint is highly robust to common forms of post-processing, including lossy 
compression and filtering. 

The goal of this chapter is not to regurgitate the theoretical foundations of the 
subject at length, as these have been discussed coherently before, including in a 
dedicated chapter in the first edition of this book (Fridrich 2013). Instead, we hope 
to give readers an overview of the ongoing research and the evolving challenges in 
the domain while keeping the focus on conveying the important concepts. 

So why is it that there is still an abundance of active research when extensive 
empirical evidence (Goljan et al. 2009) has already established the feasibility of 
highly reliable PRNU-based consumer camera identification at scale? The simple 
answer is technological progress. Modern cameras, particularly those installed in 
smartphones, go to great lengths to produce visually appealing imagery. Imaging 
pipelines are ever evolving, and computational photography challenges our under- 
standing of what a “camera-original” image looks like. On the flip side, many of 
these new processing steps interfere with the underlying assumptions at the core 
of PRNU-based camera identification, which requires strictly that the probe image 
and the camera fingerprint are spatially aligned with respect to the camera sensor 
elements. Techniques such as lens distortion correction (Goljan and Fridrich 2012), 
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electronic image stabilization (Taspinar et al. 2016), or high dynamic range imag- 
ing (Shaya et al. 2018) have all been found to impede camera identification if not 
accounted for through spatial resynchronization. Robustness to low resolution and 
strong compression is another concern (van Houten and Geradts 2009; Chuang et al. 
2011; Goljan et al. 2016; Altinisik et al. 2020, i. a.) that has been gaining more 
and more practical relevance due the widespread sharing of visual media through 
online social networks (Amerini et al. 2017; Meij and Geradts 2018). At the same 
time, the remarkable success of PRNU-based camera identification has also surfaced 
concerns for the anonymity of photographers, who may become identifiable through 
the analysis and combination of information derived from one or multiple images. 
As a result, the desire to protect anonymous image communication, e.g., in the case 
of journalism, activism, or legitimate whistle-blowing, has brought counter-forensic 
techniques (Böhme and Kirchner 2013) to suppress traces of origin in digital images 
to the forefront. 

We discuss these and related topics with a specific focus on still camera images in 
more detail below. Our treatment of the subject here is complemented by dedicated 
chapters on video source attribution (Chap. 5) and large scale camera identifica- 
tion (Chap. 6). An overview of computational photography is given in Chap. 3. 
Sections 4.2 and 4.3 start with the basics of sensor noise fingerprint estimation and 
camera identification, before Sect.4.4 delves into the challenges related to spatial 
sensor misalignment. Section 4.5 focuses on recent advances in image manipulation 
localization based on PRNU fingerprints. Counter-forensic techniques for finger- 
print removal and copying are discussed in Sect. 4.6, while Sect. 4.7 highlights early 
attempts to apply tools from the field of deep learning to domain-specific problems. 
A brief overview of relevant public datasets in Sect.4.8 follows, before Sect. 4.9 
concludes the chapter. 


4.2 Sensor Noise Fingerprints 


State-of-the-art sensor noise forensics assumes a simplified imaging model for single- 
channel images I(m,n),0 <m < M, 0 < n < N, 


I=1(1+K)+8, (4.1) 


in which the multiplicative PRNU factor K modulates the noise-free image I, 
while 0 comprises a variety of additive noise components (Fridrich 2013). Ample 
empirical evidence suggests that signal K is a unique and robust camera fingerprint 
(Goljan et al. 2009). It can be estimated from a set of L images taken with the specific 
sensor of interest. The standard procedure relies on a denoising filter F (-) to obtain 
a noise residual, 

W=- Fd), (4.2) 


68 M. Kirchner 


from the /-th image L,, 0 < I < L. The filter acts mainly as a means to increase the 
signal-to-noise ratio between the signal of interest (the fingerprint) and the observed 
image. To date, most works still resort to the wavelet-based filter as adopted by Lukáš 
et al. (2006) for its efficiency and generally favorable performance, although a number 
of studies have found that alternative denoising algorithms can lead to moderate 
improvements (Amerini et al. 2009; Cortiana et al. 2011; Al-Ani and Khelifi 2017; 
Chierchia et al. 2014, i.a.). In general, it is accepted that noise residuals obtained 
from off-the-shelf filters are imperfect by nature and that they are contaminated non- 
trivially by remnants of image content. Salient textures or quantization noise may 
exacerbate the issue. For practical applications, a simplified modeling assumption 


W, = KI, +n, (4.3) 


with i.i.d. Gaussian noise 77, leads to a maximum likelihood estimate (MLE) K of 
the PRNU fingerprint of the form (Fridrich 2013) 


(4.4) 


In this procedure, it is assumed that all images I; are spatially aligned so that the pixel- 
wise operations are effectively carried out over the same physical sensor elements. 

If available, it is beneficial to use flatfield images for fingerprint estimation to 
minimize contamination from image content. The quality of the estimate in Eq. (4.4) 
generally improves with L, but a handful of homogeneously lit images typically 
suffices in practice. Uncompressed or raw sensor output is preferable over com- 
pressed images, as it will naturally reduce the strength of unwanted nuisance signals 
in Eq. (4.1). When working with raw output from a sensor with a color filter array 
(CFA), it is advisable to subsample the images based on the CFA layout (Simon et al. 
2009). 

Practical applications warrant a post-processing step to clean the fingerprint esti- 
mate from non-unique artifacts, which may otherwise increase the likelihood of false 
fingerprint matches. Such artifacts originate from common signal characteristics that 
occur consistently across various devices, for instance, due to the distinctively struc- 
tured layout of the CFA or block-based JPEG compression. It is thus strongly rec- 
ommended to subject K to zero-meaning and frequency-domain Wiener filtering, 
as detailed in Fridrich (2013). Additional post-processing operations have been dis- 
cussed in the literature (Li 2010; Kang et al. 2012; Lin and Li 2016, 1.a.), although 
their overall merit is often marginal (Al-Ani and Khelifi 2017). Other non-trivial 
non-unique artifacts have been documented as well. Cases of false source attribution 
linked to lens distortion correction have been reported when images from a different 
camera were captured at a focal length that was prominently featured during finger- 
print estimation (Goljan and Fridrich 2012; Gloe et al. 2012). Non-unique artifacts 
introduced by advanced image enhancement algorithms in modern devices are a 
subject of ongoing research (Baracchi et al. 2020; Iuliani et al. 2021). 
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For multi-channel images, the above procedures can be applied to each color 
channel individually before averaging the obtained signals into a single-channel 
fingerprint estimate (Fridrich 2013). 


43 Camera Identification 


For a given probe image I of unknown provenance, camera identification can be 
formulated as a hypothesis testing problem: 


Ho : W = I — F(D does not contain the fingerprint of interest, K 
H; : W does contain the fingerprint K ; 


i.e., the probe is attributed to the tested camera if H, holds. In practice, the test can 
be decided by evaluating a similarity measure p, 


p = sim(W, KI) 27/7. (4.5) 


for a suitable threshold 7. Under the modeling assumptions adopted in the literature 
(Fridrich 2013), the basic building block for this is the normalized cross-correlation 
(NCC), which is computed over a grid of shifts s = (s1, 52),0 < sı < M,0 < s2 < 
N, for two matrices A, B as 


XU (ACn, n) — A) (B(m + s1, n + s2) — B) 
|a -A| |B -B| 


NCCa p(s), s2) = (4.6) 


We assume implicitly that matrices A and B are of equal dimension and that zero- 
padding has been applied to assert this where necessary. In practice, the above expres- 
sion is evaluated efficiently in the frequency domain. It is common in the field to 
approximate Eq. (4.6) by working with the circular cross-correlation, i.e., by oper- 
ating on the FFTs of matrices A, B without additional zero-padding. Taking the 
maximum NCC over a set S of admissible shifts as similarity, 


pO = max NCCw xy(S) (4.7) 


can conveniently account for potential sensor misalignment between the tested fin- 
gerprint and the probe as they would result from different sensor resolutions and/or 
cropping. Peak-to-correlation energy (PCE) has been proposed as an alternative that 
mitigates the need for sensor-specific thresholds 


2 
fa IN W) (pce) (4.8) 
me Veg NCCY gy) 
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In the equation above, M denotes a small neighborhood around the peak NCC. 
It is commonly set to a size of 11 x 11 (Goljan et al. 2009). In practice, camera 
identification must account for the possibility of mismatching sensor orientations in 
the probe image and in the fingerprint estimate, so the test for the presence of the 
fingerprint should be repeated with one of the signals rotated by 180° if the initial 
test stays below the threshold. 

The choice of set S crucially impacts the characteristics of the detector, as a larger 
search grid will naturally increase the variance of detector responses on true negatives. 
If it can be ruled out a priori that the probe image underwent cropping, the “search” 
can be confined to S = {(0, 0)} to reduce the probability of false alarms. Interested 
readers will find a detailed error analysis for this scenario in Goljan et al. (2009), 
where the authors determined a false alarm rate of 2.4 x 1075 while attributing 
97.62% of probes to their correct source in experiments with more than one million 
images from several thousand devices. The PCE threshold for this operating point is 
60, which is now used widely in the field as a result. 

A number of variants of the core camera identification formulation can be 
addressed with only a little modification (Fridrich 2013). For instance, it can be of 
interest to determine whether two arbitrary images originate from the same device, 
without knowledge or assumptions about associated camera fingerprints (Goljan et al. 
2007). In a related scenario, the goal is to compare and match a number of finger- 
print estimates, which becomes particularly relevant in setups with the objective of 
clustering a large set of images by their source device (Bloy 2008; Li 2010; Amerini 
et al. 2014; Marra et al. 2017, 1.a.). 

Specific adaptations also exist for testing against large databases of camera finger- 
prints, where computational performance becomes a relevant dimension to monitor 
in practice. This includes efforts to reduce the dimensionality of camera fingerprints 
(Goljan et al. 2010; Bayram et al. 2012; Valsesia et al. 2015, i.a.) and protocols for 
efficient fingerprint search and matching (Bayram et al. 2015; Valsesia et al. 2015; 
Taspinar et al. 2020, i.a.). We refer the reader to Chap. 6 in this book which discusses 
these techniques in detail. 


4.4 Sensor Misalignment 


Camera identification from PRNU-based sensor fingerprints can only succeed if the 
probe image and the tested fingerprint are spatially aligned. While cross-correlation 
can readily account for translation and cropping, additional steps become inevitable 
if more general geometric transformations have to be considered. This includes, 
for instance, combinations of scaling and cropping when dealing with the variety of 
image resolutions and aspect ratios supported by modern devices (Goljan and Fridrich 
2008; Tandogan et al. 2019). Specifically, assuming that a probe image I underwent 
a geometric transform 7,.(-) with parameters u*, we want to carry out the above 
basic procedures on I — Ta D. If the parameters of the transform are unknown, 
the problem essentially translates into a search over a set U of admissible candidate 
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transforms u. In practice, the maximum similarity over all candidate transforms will 
determine whether or not an image contains the tested fingerprint, 


p = max sim (T'O -F (' D), KT; ' 0) 27 T, (4.9) 


and the detector threshold should be adjusted accordingly compared to the simpler 
case above to maintain a prescribed false alarm rate. Unfortunately, this search can 
quickly become computationally expensive. A number of approximations have been 
proposed to speed up the search (Goljan and Fridrich 2008), including applying 
the inverse transform to a precomputed noise residual W instead of recomputing 
the noise residual for each candidate transform, and evaluating sim(7, (WD, K) 
instead of sim(T-1 (W), KT !(1)).! As for the search itself, a coarse-to-fine grid 
search is recommended by Goljan and Fridrich (2008) for potentially scaled and 
cropped images, while Mandelli et al. (2020) adopt a particle swarm optimization 
technique. Gradient-based search methods generally do typically not apply to the 
problem due to the noise-like characteristics of detector responses outside of a very 
sharp peak around the correct candidate transform (Fridrich 2013). 

Sensor misalignment may not only be caused by “conventional” image processing. 
With camera manufacturers constantly striving to improve the visual image quality 
that their devices deliver to the customer, advanced in-camera processing and the 
rise of computational photography in modern imaging pipelines can pose significant 
challenges in that regard. One of the earliest realizations along those lines was that 
computational lens distortion correction can introduce non-linear spatial misalign- 
ment that needs to be accounted for (Goljan and Fridrich 2012; Gloe et al. 2012). 
Of particular interest is lens radial distortion, which lets straight lines in a scene 
appear curved in the captured image. This type of distortion is especially prominent 
for zoom lenses as they are commonly available for a wide variety of consumer 
cameras. For a good trade-off between lens size, cost and visual quality, cameras 
correct for lens radial distortion through warping the captured image according to 
a suitable compensation model that effectively inverts the nuisance warping intro- 
duced by the lens. This kind of post-processing will displace image content relative 
to the sensor elements. It impairs camera identification because the strength of lens 
radial distortion depends on the focal length, i.e., images taken by the same camera at 
different focal lengths will undergo different warping in the process of lens distortion 
correction. As a result, a probe image captured at a certain focal length that was not 
(well) represented in the estimation of the camera fingerprint Kin Eq. (4.4) may not 
be associated with its source device inadvertently. 

A simple first-order parametric model to describe and invert radially symmet- 
ric barrel/pincushion distortion facilitates camera identification via a coarse-to-fine 
search over a single parameter (Goljan and Fridrich 2012), similar to the handling 
of general geometric transformations in Eq. (4.9) above. The model makes a number 


' The two expressions are equivalent when used in combination with the PCE whenever it can be 
assumed that 7, 1(W)- Ty 1D ~ Ty l(wD. 
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of simplifying assumptions that may not always hold in practice to avoid a higher 
dimensional search space, including the assumed concurrence of the optical center 
and the image center. Crucially, the practical applicability of such an approach first 
and foremost depends on the validity of the assumed distortion correction model in 
real cameras. In most circumstances, it is ultimately not fully known which design 
choices camera manufacturers make, and we are not aware of published large-scale 
evaluations that span a significant number of different devices/lenses. There are inci- 
dental observations of missed detections that suggest deviations from a purely radial 
distortion correction model (Gloe et al. 2012; Goljan and Fridrich 2014), however. 

One of the reasons why interest in the effects of lens distortion correction and 
their remedies seems to have waned over the past years may be the enormous gain in 
the popularity of smartphones and similar camera-equipped mobile devices. These 
devices operate under very different optical and computational constraints than “con- 
ventional” consumer cameras and have introduced their own set of challenges to the 
field of PRNU-based camera identification. The source attribution of video data, 
and in particular the effects of electronic image stabilization (EIS), have arguably 
been the heavyweight in this research domain, and we direct readers to Chap. 5 
of this book for a dedicated exposition. In the context of this chapter, it suffices 
to say that EIS effectively introduces gradually varying spatial misalignment to 
sequences of video frames, which calls for especially efficient computational cor- 
rection approaches (Taspinar et al. 2016; Iuliani et al. 2019; Mandelli et al. 2020; 
Altinisik and Sencar 2021, 1.a.). 

Other types of in-camera processing are more and more moving into the focus 
as well. For instance, high dynamic range (HDR) imaging (Artusi et al. 2017) is 
routinely supported on many devices today and promises enhanced visual quality, 
especially also under challenging light conditions. In smartphones, it is usually real- 
ized by fusing a sequence of images of a scene, each taken at a different exposure 
setting in rapid succession. This requires registration of the images to mitigate global 
misalignment due to camera motion and local misalignment due to moving objects in 
the scene. In practice, the HDR image will be a content-dependent weighted mixture 
of the individual exposures, and different regions in the image may have undergone 
different geometric transformations. Without additional precautions, camera identifi- 
cation may fail under these conditions (Shaya et al. 2018). While it seems infeasible 
to express such complex locally varying geometric transformations in a paramet- 
ric model, it is possible to conduct a search over candidate transformation on local 
regions of the image. Empirical evidence from a handful of smartphones suggests 
that the contributions of the individual exposures can be locally synchronized via 
cross-correlation, after correcting for a global, possibly anisotropic rescaling oper- 
ation (Hosseini and Goljan 2019). Future research will have to determine whether 
such approach generalizes. 

Sensor misalignment can be expected to remain a practical challenge, and it is 
likely to take on new forms and shapes as imaging pipelines continue to evolve. We 
surmise that the nature of misalignments will broaden beyond purely geometrical 
characteristics with the continued rise of computational photography and camera 
manufacturers allowing app developers access to raw sensor measurements. Empir- 
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ical reports of impaired camera identification across mismatching imaging pipelines 
may be taken as first cautionary writing on the wall (Joshi et al. 2020). 


4.5 Image Manipulation Localization 


When a region of an image is replaced with content from elsewhere, the new content 
willlack the characteristic camera PRNU fingerprint one would expect to find other- 
wise. This is true irrespective of whether the inserted content has been copied from 
within the same image, or from a different image. Recasting camera identification as 
a local test for the presence of an expected fingerprint thus allows for the detection 
and localization of image manipulations (Chen et al. 2008; Fridrich 2013). 

A straightforward approach is to examine the probe image I by sliding an analysis 
window of size B x B over the probe image, and to assign a binary label Y(m, n) € 
{—1, 1}, 

Y(m,n) = sgn(p(m,n)— T), (4.10) 


to the window centered around location (m,n), with p(m, n) obtained by evaluat- 
ing Eq. (4.5) for the corresponding analysis window. The resulting binary map Y 
will then be indicative of local manipulations, with Y (m, n) = —1 corresponding to 
the absence of the fingerprint in the respective neighborhood. The literature mostly 
resorts to the normalized correlation for this purpose, p = p{{® . It can be computed 
efficiently in one sweep for all sliding windows by implementing the necessary sum- 
mations as linear filtering operations on the whole image. 

The localization of small manipulated regions warrants sufficiently small analysis 
windows, which impacts the ability to reliably establish whether or not the expected 
fingerprint is present negatively. The literature often finds a window size of B = 64 
as a reasonable trade-off between resolution and accuracy (Chierchia et al. 2014; 
Chakraborty and Kirchner 2017; Korus and Huang 2017). A core problem for analysis 
windows that small is that the measured local correlation under H, depends greatly on 
local image characteristics. One possible remedy is to formulate a camera-specific 
correlation predictor p(m,n) that uses local image characteristics to predict how 
strongly the noise residual in a particular analysis window is expected to correlate 
with the purported camera fingerprint under H, (Chen et al. 2008). The decision 
whether to declare the absence of the tested fingerprint can then be conditioned on 
the expected correlation. 

Adopting the rationale that more conservative decisions should be in place when 
the local correlation cannot be expected to take on large values per se, an adjusted 
binary labeling rule decides Y(m, n) = —1 iff p(m, n) < r and p(m, n) > À, where 
threshold À effectively bounds the probability of missed fingerprint detection under 
H. (Chen et al. 2008). To ensure that the binary localization map mostly contains 
connected regions of a minimum achievable size, a pruning and post-processing step 
with morphological filters is recommended. 
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(a) pristine image (b) manipulated image 


(c) local correlation p (d) binary localization map 


Fig.4.2 Image manipulation localization. The local correlation between the camera fingerprint and 
the noise residual extracted from the manipulated image is lowest (darker) in the manipulated region. 
It was computed from sliding windows of size 64 x 64. The binary localization map (overlayed on 
top of the manipulated image) was obtained with a conditional random field approach that evaluates 
the difference between measured and predicted correlation, p — > (Chakraborty and Kirchner 2017). 
Images taken from the Realistic Tampering Dataset (Korus and Huang 2017) 


Significant improvements have been reported when explicitly accounting for the 
observation that local decisions from neighboring sliding windows are interdepen- 
dent. A natural formulation follows from approaching the problem in a global opti- 
mization framework with the objective of finding the optimal mapping 


Y= arg max p (Yip, P) ` (4.11) 


This sets the stage for rewarding piecewise constant label maps via a variety of 
probabilistic graphical modeling techniques such as Markov random fields (Chierchia 
et al. 2014) and conditional random fields (Chakraborty and Kirchner 2017; Korus 
and Huang 2017). Figure 4.2 gives an example result. 

A general downside of working with fixed-sized sliding windows is that p(m, n) 
will naturally change only very gradually, which makes the detection of very small 
manipulated regions challenging (Chierchia et al. 2011). Multi-scale reasoning over 
various analysis window sizes can mitigate this to some extent (Korus and Huang 
2017). Favorable results have also been reported for approaches that incorporate 
image segmentation (Korus and Huang 2017; Zhang et al. 2019; Lin and Li 2020) or 
guided filtering (Chierchia et al. 2014) to adaptively adjust analysis windows based 
on image characteristics. 
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Surprisingly, very few notable updates to the seminal linear correlation predictor 
P from simple intensity, flatness, and texture features by Chen et al. (2008) have 
surfaced in the literature, despite its paramount role across virtually all PRNU-based 
localization approaches and a generally rather mixed performance (Quan and Li 
2021). We highlight here the replacement of the original linear regression model 
with a feed-forward neural network by Korus and Huang (2017), and the observation 
that a more accurate prediction can be achieved when the camera’s ISO speed is taken 
into account (Quan and Li 2021). Attempts to leverage a deep learning approach to 
obtain potentially more expressive features Chakraborty (2020) are commendable 
but require a more thorough evaluation for authoritative conclusions. 

Overall, image manipulation localization based on camera sensor noise has its 
place in the broader universe of digital media forensics when there is a strong prior 
belief that the probe image indeed originates from a specific camera. If the manipu- 
lated region is sufficiently small, this can be established through conventional full- 
frame camera identification. Non-trivial sensor misalignment as discussed in Sect. 4.4 
can be expected to complicate matters significantly for localization, but we are not 
aware of a principled examination of this aspect to date. 


4.6 Counter-Forensics 


The reliability and the robustness of camera identification based on sensor noise 
have been under scrutiny ever since seminal works on sensor noise forensics sur- 
faced over 15 years ago. As a result, it is widely accepted that PRNU fingerprints 
survive a variety of common post-processing operations, including JPEG compres- 
sion and resizing. Counter-forensics focuses on more deliberate attempts to impair 
successful camera identification by acknowledging that there are scenarios where 
intelligent actors make targeted efforts to induce a certain outcome of forensic anal- 
yses (Böhme and Kirchner 2013). In this context, it is instructive to distinguish 
between two major goals of countermeasures, fingerprint removal and fingerprint 
copying (Lukas et al. 2006; Gloe et al. 2007). The objective of fingerprint removal 
is the suppression of a camera’s fingerprint to render source identification impossi- 
ble. This can be desirable in efforts of protecting the anonymity of photographers, 
journalists, or legitimate whistleblowers in threatening environments (Nagaraja et al. 
2011). Fingerprint copying attempts to make an image plausibly appear as if it was 
captured by a different camera, which is typically associated with nefarious motives. 
It strictly implies the suppression of the original fingerprint and it is thus generally a 
harder problem. The success of such counter-forensic techniques is to a large degree 
bound by the admissible visual quality of the resulting image. If an image purports to 
be camera-original but has suffered from noticeable degradation, it will likely raise 
suspicion. If anonymity is of utmost priority, strong measures that go along with a 
severe loss of image resolution are more likely acceptable. 

Existing fingerprint removal methods can be categorized under two general 
approaches (Lukas et al. 2006). Methods of the first category are side-informed 
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(a) original image (b) modified image (c) close-up 


Fig. 4.3 Fingerprint removal with PatchMatch replaces local image content with visually similar 
content from elsewhere in the image (Entrieri and Kirchner 2016). The PCE, computed considering 
all possible cross-correlation shifts, decreases from 5,617 for the original image to 32 after the 
modification. The “anonymized” image has a PSNR of 38.3 dB. Original image size: 2,000 2,000 
pixels 


in the sense that they use an estimate of the sensor noise fingerprint to ensure a 
detector output below the identification threshold. Flatfielding—a denoising tech- 
nique that targets the general imaging model in Eq. (4.1)—is known to remove the 
multiplicative noise term K effectively, but it ideally requires access to the raw sensor 
measurements (Lukas et al. 2006; Gloe et al. 2007). Adaptive fingerprint removal 
techniques explicitly attempt to minimize Eq. (4.5) by finding a noise sequence that 
cancels out the multiplicative fingerprint term in Eq. (4.3) (Karaküçük and Dirik 
2015; Zeng et al. 2015). This works best when exact knowledge of the detector (and 
thus the images used to estimate K) is available. 

Uninformed techniques make less assumptions and directly address the robust- 
ness of the sensor noise fingerprint. Methods of this category apply post-processing 
to the image until the noise pattern is too corrupted to correlate with the finger- 
print. No specific knowledge of the camera, the camera’s fingerprint, or the detector 
is assumed in this process. However, the high robustness of the sensor fingerprint 
makes this a non-trivial problem (Rosenfeld and Sencar 2009), and solutions may 
often come with a more immediate loss of image quality compared to side-informed 
methods. One promising direction is to induce irreversible sensor misalignment. 
Seam-carving—a form of content-adaptive resizing that shrinks images by remov- 
ing low-energy “seams” (Avidan and Shamir 2007)—is an effective candidate oper- 
ation in this regard (Bayram et al. 2013), although a considerable amount of seams 
must be removed to successfully desynchronize the sensor fingerprint (Dirik et al. 
2014). This bears a high potential for the removal of “important” seams, degrading 
image quality and resolution. In many ways, recomposing the image entirely instead 
of removing a large number of seams is thus a more content-preserving alterna- 
tive. The idea here is to replace local content with content from elsewhere in the 
image with the objective of finding replacements that are as similar to the original 
content as possible while lacking the telltale portion of the fingerprint. A modified 
version of the PatchMatch algorithm (Barnes et al. 2009) has been demonstrated to 
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produce viable results (Entrieri and Kirchner 2016), while a later variant employed 
inpainting for this purpose (Mandelli et al. 2017). Figure 4.3 showcases an example 
of successful fingerprint removal with the PatchMatch algorithm. All three strate- 
gies, seam-carving, PatchMatch, and inpainting, can reliably prevent PRNU-based 
camera identification from a single image. An aggregation of fingerprint traces from 
multiple “anonymized” images from the same camera can reestablish the link to the 
common source device to some extent, however (Taspinar et al. 2017; Karaküçük 
and Dirik 2019). 

In a fingerprint copy attack, a nefarious actor Eve operates with the goal of making 
an arbitrary image J look as if it was captured by an innocent user Alice’s camera. 
Eve may obtain an estimate K z of Alice’s camera fingerprint from a set of publicly 
available images and leverage the multiplicative nature of the PRNU to obtain (Lukas 
et al. 2006) 

J =Jd+o0Kz), (4.12) 


where the scalar factor a > 0 determines the fingerprint strength. Attacks of this type 
have been demonstrated to be effective, in the sense that they can successfully mis- 
lead a camera identification algorithm in the form of Eq. (4.5). The attack’s success 
generally depends on a good choice of a: too low values mean that the bogus image 
J’ may not be assigned to Alice’s camera; a too strong embedding will make the 
image appear suspicious (Goljan et al. 2011; Marra et al. 2014). In practical scenar- 
ios, Eve may have to apply further processing to make her forgery more compelling, 
e.g., removing the genuine camera fingerprint, synthesizing color filter interpola- 
tion artifacts (Kirchner and Böhme 2009), and removing or adding traces of JPEG 
compression (Stamm and Liu 2011). 

Under realistic assumptions, it is virtually impossible to prevent Eve from forcing a 
high similarity score in Eq. (4.5). All is not lost, however. Alice can utilize that noise 
residuals computed with practical denoising filters are prone to contain remnants 
of image content. The key observation here is that the similarity between a noise 
residual Wy from an image I taken with Alice’s camera and the noise residual Wy 
due to a common attack-induced PRNU term will be further increased by some 
shared residual image content, if I contributed to Eve’s fingerprint estimate Ke. 
The so-called triangle test (Goljan et al. 2011) picks up on this observation by also 
considering the correlation between Alice’s own fingerprint and both Wy and Wy to 
determine whether the similarity between Wy with W y is suspiciously large. A pooled 
version of the test establishes whether any images in a given set of Alice’s images 
have contributed to Ki (Barni et al. 2018), without determining which. Observe that 
in either case Alice may have to examine the entirety of images ever made public 
by her. On Eve’s side, efforts to create a fingerprint estimate Kr ina procedure that 
deliberately suppresses telltale remnants of image content can thwart the triangle 
test’s success (Caldelli et al. 2011; Rao et al. 2013; Barni et al. 2018), thereby only 
setting the stage for the next iteration in the cat-and-mouse game between attacks 
and defenses. 

The potentially high computational (and logistical) burden and the security con- 
cerns around the triangle test can be evaded in a more constrained scenario. Specif- 
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ically, assume that Eve targets the forgery of an uncompressed image but only has 
access to images shared in a lossy compression format when estimating Ke. Here, it 
can be sufficient to test for the portion of the camera fingerprint that is fragile to lossy 
compression to establish that Eve’s image J’ does not contain a complete fingerprint 
(Quiring and Kirchner 2015). This works because the PRNU in uncompressed images 
is relatively uniform across the full frequency spectrum, whereas lossy compression 
mainly removes high-frequency information from images. As long as Alice’s public 
images underwent moderately strong compression, such as JPEG at a quality factor 
of about 85, no practicable remedies for Eve to recover the critically missing portion 
of her fingerprint estimate are known at the time of this writing (Quiring et al. 2019). 


4.7 Camera Fingerprints and Deep Learning 


The previous sections have hopefully given the reader the impression that research 
around PRNU-based camera fingerprints is very much alive and thriving, and that 
new (and old) challenges continue to spark the imagination of academics and prac- 
titioners alike. Different from the broader domain of media forensics, which is now 
routinely drawing on deep learning solutions, only a handful of works have made 
attempts to apply ideas from this rapidly evolving field to the set of problems typically 
discussed in the context of device-specific (PRNU-based) camera fingerprints. We 
can only surmise that this in part due to the robust theoretical foundations that have 
defined the field and that have ultimately led to the wide acceptance of PRNU-based 
camera fingerprints in practical forensic casework, law enforcement, and beyond. 
Data-driven “black-box” solutions may thus appear superfluous to many. 

However, one of the strengths that deep learning techniques can bring to the rigid 
framework of PRNU-based camera identification is in fact their very nature: they 
are data-driven. The detector in Sect.4.3 was originally derived under a specific 
set of assumptions with respect to the imaging model in Eq. (4.1) and the noise 
residuals in Eq. (4.3). There is good reason to assume that real images will deviate 
from these simplified models to some degree. First and foremost, noise residuals 
from a single image will always suffer from significant and non-trivial distortion, if 
we accept that content suppression is an ill-posed problem in the absence of viable 
image models (and possibly even of the noise characteristics itself (Masciopinto and 
Pérez-Gonzalez 2018)). This opens the door to potential improvements from data- 
driven approaches, which ultimately do not care about modeling assumptions but 
rather learn (hopefully) relevant insights from the training data directly. 

One such approach specifically focuses on the extraction of a camera signature 
from a single image I at test time (Kirchner and Johnson 2019). Instead of relying 
on a “blind” denoising procedure as it is conventionally the case, a convolutional 
neural network (CNN) can serve as a flexible non-linear optimization tool that learns 
how to obtain a better approximation of K. Seay: the network is trained to 
extract a noise pattern K to minimize IK — K|, as the pre-computed estimate K is 
the best available approximation of the actual PRNU signal under the given imaging 
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Fig.4.4 Camera identification from noise signals computed with a deep-learning based fingerprint 
extractor (coined “SPN-CNN”) (Kirchner and Johnson 2019). The ROC curves were obtained by 
thresholding the normalized correlation, o, between the extracted SPN-CNN noise patterns, 
K, and “conventional” camera fingerprints, K, for patches of size 100 x 100 (blue) and 50 x 50 
(orange). Device labels correspond to devices in the VISION (Shullani et al. 2017) (top) and Dresden 
Image Databases (Gloe and Böhme 2010) (bottom). Curves for the standard Wavelet denoiser and 
an off-the-shelf DnCNN denoiser (Zhang et al. 2017) are included for comparison 


model. The trained network replaces the denoiser F (-) at test time, and the fingerprint 
similarity is evaluated directly for K instead of KI in Eq. (4.5). Notably, this breaks 
with the tradition of employing the very same denoiser for both fingerprint estimation 
and detection. Empirical results suggest that the resulting noise signals have clear 
benefits over conventional noise residuals when used for camera identification, as 
showcased in Fig.4.4. A drawback of this approach is that it calls for extraction 
CNN's trained separately for each relevant camera (Kirchner and Johnson 2019). 

The similarity function employed by the detector in Eq. (4.5) is another target 
for potential improvement. If model assumptions do not hold, the normalized cross- 
correlation may no longer be a good approximation of the optimal detector (Fridrich 
2013), and a data-driven approach may be able to reach more conclusive decisions. 
Along those lines, a Siamese network structure trained to compare spatially aligned 
patches from a fingerprint estimate K and the noise residual W from a probe image I 
was reported to greatly outperform a conventional PCE-based detector across a range 
of image sizes (Mandelli et al. 2020). Different from the noise extraction approach 
by Kirchner and Johnson (2019), experimental results suggest that no device-specific 
training is necessary. Both approaches have not yet been subjected to more realistic 
settings that include sensor misalignment. 

While the two works discussed above focus on specific building blocks in the 
established camera identification pipeline, we are not currently aware of an end-to- 
end deep learning solution that achieves source attribution at the level of individual 
devices at scale. There is early evidence, however, to suggest that camera model 
traces derived from a deep model can be a beneficial addition to conventional camera 
fingerprints, especially when the size of the analyzed images is small (Cozzolino et al. 
2020). 
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CNNs have also been utilized in the context of counter-forensics, where the two- 
fold objective of minimizing fingerprint similarity while maximizing image quality 
almost naturally invite data-driven optimization solutions to the problem of side- 
informed fingerprint removal. An early proposal trains an auto-encoder inspired 
anonymization operator by encoding the stated objectives in a two-part cost function 
(Bonettini et al. 2018). The solution, which has to be retrained for each new image, 
relies on a learnable denoising filter as part of the network, which is used to extract 
noise residuals from the “anonymized” images during training. The trained network is 
highly effective at test time as long as the detector uses the same denoising function, 
but performance dwindles when a different denoising filter, such as the standard 
Wavelet-based approach, is used instead. This limits the practical applicability of the 
fingerprint removal technique for the time being. 

Overall, it seems almost inevitable that deep learning will take on a more promi- 
nent role across a wide range of problems in the field of (PRNU-based) device 
identification. We have included these first early works here mainly to highlight the 
trend, and we invite the reader to view them as important stepping stones for future 
developments to come. 


4.8 Public Datasets 


Various publicly available datasets have been compiled over time to advance research 
on sensor-based device identification, see Table 4.1. Not surprisingly, the evolution 
of datasets since the release of the trailblazing Dresden Image Database (Gloe and 
Böhme 2010) mirrors the general consumer shift from dedicated digital cameras 
to mobile devices. There are now also several diverse video datasets with a good 
coverage across different devices available. In general, a defining quality of a good 
dataset for source device identification is not only the number of available images 
per unique device, but also whether multiple instances of the same device model 
are represented. This is crucial for studying the effects of model-specific artifacts 
on false alarms, which may remain unidentified when only one device per camera 
model is present. Other factors worth considering may be whether the image/video 
capturing protocol controlled for certain exposure parameters, or whether all devices 
were used to capture the same set (or type) of scenes. 

Although it has become ever more easy to gather suitably large amounts of data 
straight from some of the popular online media sharing platforms, dedicated custom- 
made research datasets offer the benefit of a well-documented provenance while fos- 
tering the reproducibility of research and mitigating copyright concerns. In addition, 
many of these datasets include, by design, a set of flatfield images to facilitate the 
computation of high-quality sensor fingerprints. However, as we have seen through- 
out this chapter, a common challenge in this domain is to keep pace with the latest 
technological developments on the side of camera manufacturers. This translates 
into a continued need for updated datasets to maintain practical relevance and time- 
liness. Results obtained on an older dataset may not hold up well on data from newer 
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Table 4.1 Public datasets with a special focus on source device identification. The table lists the 
number of available images and videos per dataset, the number of unique devices and camera models 
covered, as well as the types of cameras included (C: consumer, D: DSLR, M: mobile device) 


Dataset Year Images Devices | Models | Cameras 
Dresden 2010 14k+ 73 25 CD 
Gloe and Böhme (2010)* 

RAISE 2015 8k+ 3 3 D 
Dang-Nguyen et al. (2015) 

VISION I 2017 llk+ 35 30 M 
Shullani et al. (2017) 

HDR 2018 5k+ 23 22 M 
Shaya et al. (2018) 

SOCRatES 2019 9k+ 103 60 M 
Galdi et al. (2019) 

video-ACID 2019 46 36 CDM 
Hosler et al. (2019) 

NYUAD-MMD 2020 6k+ 78 62 M 
Taspinar et al. (2020) 

Daxing 2020 43k+ 1k+ 90 22 M 
Tian et al. (2019) 

Warwick 2020 58k+ 14 11 CD 
Quan et al. (2020) 

Forchheim _ | 2020 3k+ 27 25 M 
Hadwiger and Riess (2020)"* 


“Includes multiple images of the same scene with varying exposure settings 
“Provides auxiliary data from sharing the base data on various social network sites 


devices due to novel sensor features or acquisition pipelines. A good example is 
the recent release of datasets with a specific focus on high dynamic range (HDR) 
imaging (Shaya et al. 2018; Quan et al. 2020), or the provision of annotations for 
videos that underwent electronic image stabilization (Shullani et al. 2017). With the 
vast majority of media now shared in significantly reduced resolution through online 
social network platforms or messaging apps, some of the more recent datasets also 
consider such common modern-day post-processing operations explicitly (Shullani 
et al. 2017; Hadwiger and Riess 2020). 


4.9 Concluding Remarks 


Camera-specific sensor noise fingerprints are a pillar of media forensics, and they are 
unrivaled when it comes to establishing source device attribution. While our focus 
in this chapter has been on still camera images (video data is covered in Chap.5 of 
this book), virtually all imaging sensors introduce the same kind of noise fingerprints 
as we discussed here. For example, line sensors in flatbed scanners received early 
attention (Gloe et al. 2007; Khanna et al. 2007), and recent years have also seen 
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biometric sensors move into the focus (Bartlow et al. 2009; Kauba et al. 2017; 
Ivanov and Baras 2017, 2019, i.a.). 

Although keeping track of advances by device manufacturers and novel imaging 
pipelines is crucial for maintaining this status, an active research community has 
so far always been able to adapt to new challenges. The field has come a long way 
over the past 15 years, and new developments such as the cautious cross-over into 
the world of deep learning promise a continued potential for fruitful exploration. 
New perspectives and insights may also arise from applications outside the realm of 
media forensics. For example, with smartphones as ubiquitous companions in our 
everyday life, proposals to utilize camera fingerprints as building blocks to multi- 
factor authentication let users actively provide their device’s sensor fingerprint via a 
captured image to be granted access to a web server (Valsesia et al. 2017; Quiring 
et al. 2019; Maier et al. 2020, i.a.). This poses a whole range of interesting practical 
challenges on its own, but it also invites a broader discussion about what camera 
fingerprints mean to an image-saturated world. At the minimum, concerns over image 
anonymity must be taken seriously in situations that call for it, and so we also see 
counter-forensics as part of the bigger picture unequivocally. 
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Chapter 5 A) 
Source Camera Attribution from Videos EE 


Husrev Taha Sencar 


Photo-response non-uniformity (PRNU) is an intrinsic characteristic of a digital 
imaging sensor, which manifests as a unique and permanent pattern introduced to 
all media captured by the sensor. The PRNU of a sensor has been proven to be a 
viable identifier for source attribution and has been successfully utilized for identi- 
fication and verification of the source of digital media. In the past decade, various 
approaches have been proposed for the reliable estimation, compact representation, 
and faster matching of PRNU patterns. These studies, however, mainly featured pho- 
tographic images, and the extension of source camera attribution capabilities to the 
video domain remains a problem. This is primarily because the steps involved in 
the generation of a video are much more disruptive to the PRNU pattern, and there- 
fore, its estimation from videos involves several additional challenges. This chapter 
examines these challenges and outlines recently developed methods for estimating 
the sensor’s PRNU from videos. 


5.1 Introduction 


The PRNU is caused by variations in size and material properties of the photosen- 
sitive elements that comprise a sensor. These essentially affect the response of each 
picture element under the same amount of illumination. Therefore, the process of 
identification boils down to quantifying the sensitivity of each picture element. This 
is realized through an estimation procedure using a set of pictures acquired by the 
sensor (Chen et al. 2008). To determine whether a media is captured by a given sensor, 
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the estimated sensitivity profile, i.e., the PRNU pattern, from the media in question 
is compared to a reference PRNU pattern obtained in advance using a correlation- 
based measure, most typically using the peak-to-correlation energy (PCE) (Vijaya 
Kumar 1990).! In essence, the reliability and accuracy of a source attribution deci- 
sion strongly depend on the fact that the PRNU-bearing raw signal at the sensor 
output has to pass through several steps of in-camera processing before the media 
is generated. These processing steps have a disruptive effect on the inherent PRNU 
pattern. 

The imaging sub-systems used by digital cameras largely remain proprietary to the 
manufacturer; therefore, it is quite difficult to access their details. At a higher level, 
however, the imaging pipeline in a camera includes various stages, such as acquisi- 
tion, pre-processing, color-processing, post-processing, and image and video coding. 
Figure 5.1 shows the basic processing steps involved in capturing a video, most of 
which are also utilized during the acquisition of a photograph. When generating a 
video, an indispensable processing step is the downsizing of the full-frame sensor 
output to reduce the amount of data that needs processing. This may be realized at 
acquisition by sub-sampling pixel readout data (trough binning Zhang et al. 2018 or 
skipping Guo et al. 2018 pixels), as well as during color-processing by downsampling 
color interpolated image data. Another key processing step employed by modern day 
cameras at the post-processing stage is image stabilization. Since many videos are 
not captured while the camera is in a fixed position, all modern cameras perform 
image stabilization to correct perspective distortions caused by handheld shooting or 
other vibrations. Stabilization can be followed by other post-processing steps, such 
as noise reduction and dynamic range adjustment, which may also include transfor- 
mations, resulting in further weakening of the PRNU pattern. Finally, the sequence of 
post-processed pictures is encoded into a standard video format for effective storage 
and transfer. 

Video generation involves at least three additional steps compared to the gener- 
ation of photos, including downsizing, stabilization, and video coding. When com- 
bined together, these operations have a significant adverse impact on PRNU esti- 
mation in two main respects. The first relates to geometric transformations applied 
during downsizing and stabilization, and the second concerns information loss largely 
caused by resolution reduction and compression. These detrimental effects could be 
further compounded by out-of-camera processing. Video editing tools today pro- 
vide a wide variety of sophisticated processing options over cameras as they are less 
bounded by computational resources and real-time constraints. Hence, the estimation 
of sensor PRNU from videos requires a thorough understanding of the implications 
of these operations and the development of effective mitigations against them. 


' Details on the estimation of the PRNU pattern from photographic images can be found in the 
previous two chapters of this book. 
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Fig. 5.1 The processing steps involved in the imaging pipeline of a camera when generating a 
video. The highlighted boxes are specific to video capture 


5.2 Challenges in Attributing Videos 


The now well-established PRNU estimation approach was devised considering in- 
camera processing relevant to the generation of photographic images. There are, 
however, several obstacles that prevent its direct application to the sequence of pic- 
tures that comprise a video. These difficulties can be summarized as follows. 

Video frames have lower resolution than the full-sensor resolution typically used 
for acquiring photos. Therefore, a reference PRNU pattern estimated from photos 
provides a more comprehensive characteristic of the sensor. Hence, the use of a 
photo-based reference pattern for attribution requires the mismatch with the dimen- 
sions of video frames to be taken into account. Alternatively, one might consider 
creating a separate reference PRNU pattern at all frame sizes. This, however, may 
not always be possible as videos can be captured at many resolutions (Tandogan 
et al. 2019). Moreover, in the case of smartphones, this may vary from one camera 
app to another based on the preference of developers among a set of supported reso- 
lutions. Overall, the downsizing operation in a camera involves various proprietary 
hardware and software mechanisms that involve both cropping and resizing opera- 
tions. Therefore, source attribution from videos requires the determination of those 
device-dependent downsizing parameters. When these parameters are not known a 
priori, PRNU estimation also has to incorporate the search for these parameters. 

The other challenge concerns the fact that observed PCE values in matching 
PRNU patterns are much lower in videos compared to photos. This decrease in PCE 
values is primarily caused by the downsizing operation and video compression. More 
critically, low PCE values are encountered even when the PRNU pattern is estimated 
from multiple frames of the video whose source is in question, i.e., reference-to- 
reference matching. When performing reference-to-reference matching, downsizing 
can be ignored as a factor as long as the resizing factor is higher than £ (Tandogan 
et al. 2019), and the compression is the main concern. In this setting, at medium- 
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to-low compression levels (2 Mbps to 600 Kbps coding bit rates), the average PCE 
value drops significantly as compression becomes more severe (where PCE values 
drop from around 2000 to 40) (Altinisik et al. 2020). Alternatively, when PRNU 
patterns from video frames are individually matched with the reference pattern (i.e., 
frame-to-reference matching), even downsizing by a factor of two causes a significant 
reduction in measured PCE values (Bondi et al. 2019). Frame-to-reference matching 
at HD-sized frames has been found to yield PCE values mostly around 20 (Altinisik 
and Sencar 2021). 

Correspondingly, a further issue concerns the difficulty of setting a decision thresh- 
old for matching. Large-scale tests performed on photographic images show that set- 
ting the PCE value to 60 as a threshold yields extremely low false matches when the 
correct-match rate is quite high. In contrast, as demonstrated in the results of earlier 
works, where decision thresholds of 40—100 (Iuliani et al. 2019) and 60 (Mandelli 
et al. 2020) are utilized when performing frame-to-reference matching, such thresh- 
old values on video frames yield much lower attribution rates. 

Another major challenge to video source attribution is posed by image stabiliza- 
tion. When performed electronically, stabilization requires determining and counter- 
acting undesired camera motion. This involves the application of frame or sub-frame 
level geometric transformations to align the content in successive pictures of a video. 
From the PRNU estimation standpoint, this requires registration of PRNU patterns 
by inverting stabilization transformations in a blind manner. This both increases 
the computational complexity of PRNU estimation and reduces accuracy due to an 
increase in false-positive matches generated during the search for unknown transfor- 
mation parameters. 

Although in-camera processing steps introduce artifacts that obstruct correct attri- 
bution, they can also be utilized to perform more reliable matching. For example, 
biases introduced to the PRNU estimate by the demosaicing operation and blockiness 
caused by compression are known to introduce periodic structures onto the estimated 
PRNU pattern. These artifacts can essentially be treated as pilot signals to detect clues 
about the processing history of a media after the acquisition. In the case of photos, 
the linear pattern associated with the demosaicing operation has been shown to be 
effective in determining the amount of shift, rotation, and translation, with weaker 
presence in newer cameras (Goljan 2018). In the case of videos, however, the linear 
pattern is observed to be even weaker, most likely due to the application of in-camera 
downsizing and more aggressive compression of video frames compared to photos. 
Therefore, it cannot be reliably utilized in identifying global or local transformations. 
In a similar manner, since video coding uses variable block sizes determined adap- 
tively during encoding, as opposed to a fixed block size used by still image coding, 
blockiness artifact is also not useful in reducing the computational complexity of 
determining the subsequent processing. 

Obviously, the most important factor that underlies these challenges is the inability 
to access details of the in-camera processing steps due to strong competition between 
manufacturers. In the lack of such knowledge and with continued advances in cam- 
era technology, research in this area has to treat in-camera processing largely as a 
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black-box and propose PRNU estimation methods that can cope with this ambigu- 
ity. As a result, the estimation of the PRNU from videos is likely to incur higher 
computational complexity and yield lower attribution accuracy. 


5.3 Attribution of Downsized Media 


Cameras capture images and videos at a variety of resolutions. This is typically 
performed to accommodate different viewing options and support different display 
and print qualities. However, downsizing becomes necessary when taking a burst of 
shots or capturing a video in order to reduce the amount of data that needs processing. 
Furthermore, to perform complex operations like image stabilization, typically, the 
sensor is cropped and only the center portion is converted into an image. 

The downsizing operation can be realized through different mechanisms. In the 
most common case, cameras capture data at full-sensor resolution which is converted 
into an image and then resampled in software to the target resolution. An alternative 
or precursor to sub-sampling is on-sensor binning or averaging which effectively 
obtains larger pixels. This may be performed either at the hardware level during the 
acquisition by electrically connecting neighboring pixels or by averaging pixel values 
digitally immediately after analog-to-digital conversion. Therefore, as a result of 
performing resizing, pixel binning, and sensor cropping, photos and videos captured 
at different resolutions are likely to yield non-matching PRNUs. 

The algorithm for how a camera performs downsizing of full-frame sensor output 
when capturing a photo or video is implemented as part of the in-camera processing 
pipeline and is proprietary to every camera maker. Therefore, the viable approach 
to understand downsizing behavior is to acquire media at different resolutions and 
compare the resulting PRNUs to determine the best method for matching PRNU 
patterns at different resolutions. 

Given that smartphones and tablets have become the defacto camera today, an 
important concern is whether the use of the default camera interface, i.e., the native 
camera app, to capture photos and videos only exposes certain default settings and 
does not reveal all possible options that may potentially be selected by other camera 
apps that utilize the built-in camera hardware in a device. 

To address this limitation, Tandogan et al. (2019) used a custom camera app 
to discover the features of 21 Android-based smartphone cameras. This approach 
also allowed direct control over various camera settings, such as the application of 
electronic zoom, deployment of video stabilization, compression, and use of high 
dynamic range (HDR) imaging, which may interfere with the sought after downsizing 
behavior. Most notably, it was determined that cameras can capture photos at 15— 
30 resolutions and videos at 10—20 resolutions, with newer cameras offering more 
options. A great majority of these resolutions were in 4:3 and 16:9 aspect ratios. 
To determine the active sensor pixels used during video acquisition, videos were 
captured at all supported frame resolutions and a photo was taken while recording 
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continued. This setting led to the finding that each photo is taken at the highest reso- 
lution of the same aspect ratio, which possibly indicates that frames are downscaled 
by a fixed factor before being encoded into a video. 

The maximum resolution of a video is typically much smaller than the full-sensor 
size. Therefore, a larger amount of cropping and scaling has to be performed on video 
frames. Hence, the two most important questions that relate to in-camera downsizing 
are about its impact on the reliability of PRNU matching and how matching should 
be performed when two PRNU patterns are of different sizes. 


5.3.1 The Effect of In-Camera Downsizing on PRNU 


Investigating the effects of downsizing on camera ID verification, (Tandogan et al. 
2019) determined that an important consideration is knowledge of the specifics of 
the downsizing operation. When downscaling is performed using the well-known 
bilinear, bicubic, and Lanczos interpolation filters, the resulting PCE values for all 
resizing factors higher than 5 (which reduces the number of pixels in the original 
image by a factor of 12°) is found to yield PCE values higher than the commonly 
used decision threshold of 60. However, when there is a mismatch between the 
method used for downscaling the reference PRNU pattern and the downsizing method 
deployed by a camera, the matching performance dropped very fast. In this setting, 
even at the scaling factor of i estimated PRNUs could not be matched. Hence, it can 
be deduced that when the underlying method is known, downsizing does not pose a 
significant obstacle to source attribution despite significant data loss. When it is not 
known, however, in-camera downsizing is disruptive to successful attribution even 
at relatively low downsizing factors. 


5.3.2 Media with Mismatching Resolutions 


The ability to reliably match PRNU patterns extracted from media at different resolu- 
tions depends on knowledge of how downsizing is performed. In essence, this reduces 
to determining the amount of cropping and scaling applied to media at lower reso- 
lution. Although implementation details of these operations may remain unknown, 
the area and position of cropping and an approximate scaling factor can be identi- 
fied through a search. An important concern is whether the search for downsizing 
parameters should be performed in the spatial domain or the PRNU domain. 

Since downsizing is performed in the spatial domain, the search has to be ideally 
performed in the spatial domain where identified parameters are validated based on 
the match of the estimated PRNU with the reference PRNU. In this search setting, 
each trial for the parameters in the spatial domain has to be followed by PRNU 
estimation. As downsizing parameters are determined through a brute-force search 
and the search space for the parameters can be large, this may be computationally 
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expensive. This is indeed a greater concern when it comes to inverting stabilization 
transformations because the search has to be performed for each frame individually 
rather than once for the camera model. Hence, an alternative approach is to perform 
the search directly using the PRNU patterns, removing the need for repetitive PRNU 
estimation. However, since the PRNU estimation process is not linear in nature, this 
change in the search domain is likely to introduce a degradation in performance. 
To determine this, Altinisik et al. (2021) applied random transformations to a set of 
video frames, estimated PRNU patterns, inverted the transformation, and evaluated 
the match with the corresponding reference patterns. The resulting PCE values when 
compared to performing transformation and inversion in the spatial domain (to also 
take into account the effect of transformation-related interpolation) showed that the 
search of parameters in the PRNU domain yielded mostly acceptable results. 

Source attribution under geometric transformations has been previously studied. 
Considering scaled and cropped photographic images, Goljan et al. (2008) demon- 
strated that through a search of transformation parameters, the alignment between 
PRNU patterns can be re-established. For this, the PRNU pattern obtained from the 
image in question is upsampled in discrete steps and matched with the reference 
PRNU at all shifts. The parameter values that yield the highest PCE are identified 
as the correct scaling factor and the cropping position. In a similar manner, a PRNU 
pattern extracted from a higher resolution video can be compared with PRNU pat- 
terns extracted from low-resolution videos. With this goal, Refs. Tandogan et al. 
(2019), Iuliani et al. (2019), and Taspinar et al. (2020) examined in-camera downsiz- 
ing behavior to enable reliable matching of PRNU patterns at different resolutions. 
Essentially to determine how scaling and cropping are performed, the high-resolution 
PRNU pattern is downscaled and a search is performed to determine cropping bound- 
aries by computing the normalized cross-correlation (NCC) between the two patterns. 
This process is repeated by downsizing the high-resolution pattern in decrements and 
performing the search until the alignment is established. Alternatively, Bellavia et al. 
(2021) proposed a content alignment approach. That is, instead of aligning PRNU 
patterns extracted from photos and pictures of a video captured independently, they 
used pictures and videos of a static scene and estimated the scaling and cropping 
parameters through keypoint registration. 

The presence of electronic stabilization poses two complications to this process. 
First, since stabilization transformations vary from frame to frame, the above search 
may at best (i.e., only when stabilization transformation is in the form of a frame level 
affine transformation) identify the combined effect, and downsizing-related cropping 
and scaling can only be approximately determined. Second, the search has to be 
performed at the frame level (frame-to-reference matching) where PRNU strength 
will be significantly weaker as opposed to using a reference pattern obtained from 
multiple frames (reference-to-reference matching). Therefore, the process will be 
more error prone. Ultimately, since downsizing-related parameters (i.e., scaling ratio 
and cropping boundaries) vary at the camera model level, the above approach can be 
used to generate a dictionary of parameters that specify how in-camera downsizing is 
performed using videos acquired under controlled conditions to prevent interference 
from stabilization. 
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5.4 Mitigation of Video Coding Artifacts 


As a raw video is a sequence of images, its size is impractically large to store or 
transfer. Video compression exploits the fact that frames in a video sequence are 
highly correlated in time and aims to reduce the spatial and temporal redundancy 
so that as few bits as possible are used to represent the video sequence. Conse- 
quently, compression-related information loss causes much more severe artifacts in 
videos compared to those in photos. Currently, the most prevalent and popular video 
compression standards are H.264/AVC and its recent descendent H.265/HEVC. 

The ability to cope with video coding artifacts crucially requires addressing two 
problems. The first relates to the disruptive effects of a filtering operation, i.e., the 
loop filtering, deployed by video codecs to suppress artifacts arising due to block- 
based operation of video coding standards. Proposed solutions to cope with this 
have either tried to compensate for such effects in post-processing or to eliminate 
its effects by intervening in the decoding process. The second problem concerns 
compression-related information loss. Several approaches have been proposed to 
combat the adverse effects of compression by incorporating frame and block level 
encoding information to the PRNU estimation. 


5.4.1 Video Coding from Attribution Perspective 


The widely used PRNU estimation method tailored for photographic images in fact 
takes into account compression artifacts. To combat this, the Fourier domain rep- 
resentation of the PRNU pattern is Wiener filtered to remove all structural patterns 
such as those caused by blockiness. However, video coding has several important dif- 
ferences from still image coding (Richardson 2011; Sze et al. 2014), and the PRNU 
estimation approach for videos should take these differences into account. 

Block-based operation: Both H.264 and H.265 video compression standards oper- 
ate by dividing a picture into smaller blocks. Each coding standard has different 
naming and size limits for those blocks. In H.264, the largest blocks are called mac- 
roblocks and they are 16 x 16 pixels in size. A macroblock can be further divided 
into smaller sub-macroblock regions that are as small as 4 x 4 pixels. H.265 format 
is an improved version of H.264, and it is designed with parallel processing in mind. 
The sizes of container blocks in H.265 can vary from 64 x 64 to 16 x 16 pixels. 
These container blocks can be recursively divided into smaller blocks (down to the 
size of 8 x 8 pixels) which may be further divided into sub-blocks of 4 x 4 pixels 
during prediction. For the sake of brevity, we refer to all block types as macroblocks 
even though the term macroblock no longer exists in the H.265 standard. 

Frame types: A frame in an H.264 or H.265 video can have three types: I, P, and 
B. An I frame is self-contained and temporally independent. It can be likened to a 
JPEG image in this sense. However, they are quite different in many aspects including 
variable block sizes, use of prediction, integer transformation derived from a form 
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of discrete sine transform (DST), variable quantization, and deployment of a loop 
filter. The macroblocks in P frame can use previous I or P frame macroblocks to find 
the best prediction. Encoding of a B frame provides the same flexibility. Moreover, 
future I and P frames can be utilized when predicting B frame macroblocks. A video 
is composed of a sequence of frames that may exhibit a fixed or varying pattern of I, 
P, and B frames. 

Block types: Similar to frames, macroblocks can be categorized into three types, 
namely, I, P, and B. An I macroblock is intra-coded with no dependence on future or 
previous frames, and I frames only contain this type of block. A P macroblock can 
be predicted from a previous P or I frame, and P frames can contain both types of 
blocks. Finally, a B macroblock can be predicted using one or two frames of I or P 
types. A B frame can contain all three types of blocks. If a block happens to have 
the same motion vectors as its preceding block and few or no residuals, this block 
is skipped since it can be constructed at the decoder side using the preceding block 
information. This type of block has the inherent properties of its preceding blocks 
and they are called skipped blocks. Skipped blocks might appear in P and B frames 
only. 

Rate-Distortion optimization: During encoding, each picture block is predicted 
either from the neighboring blocks of the same frame (intra-prediction) or from the 
blocks of the past or future frames (inter-prediction). Therefore, each coded block 
contains the position information of the reference picture block (motion vector) and 
the difference between the current block and the reference block (residual). This 
residual matrix is first transformed to the frequency domain and then quantized. This 
is also the stage where information loss takes place. 

There is ultimately a trade-off between the amount of data used for representing a 
picture (data rate) and the quality of the reconstructed picture. Finding an optimum 
balance has been a research topic for a long time in video coding. The approach taken 
by the H.264 standard is based on rate-distortion optimization (RDO). According to 
the RDO algorithm, the relationship between the rate and distortion is defined as 


J=D+AR, (5.1) 


where D is the distortion introduced to a picture block, R is the number of bits 
required for its coding, is a Lagrangian multiplier computed as a function of the 
quantization parameter Q P, and J is the rate-distortion (RD) value to be minimized. 
The value of J obviously depends on several coding parameters such as block type (I, 
P or B), intra-prediction mode (samples used for extrapolation), the number of pair 
of motion vectors, sub-block size (4 x 4, 8 x 4, etc.), and the Q P. Each parameter 
combination is called coding mode of the block, and the encoder picks the mode m 
that yields the minimum J as given in 


m = argmin Jj, (5.2) 


l 
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where the index i ranges over all possible coding modes. It must be noted that the 
selection for the optimum mode in Eq. (5.2) is performed on the encoder side. That 
is, the encoder knows all variables involved in the computation of Eq. (5.1). The 
decoder, in contrast, has only the knowledge of the block type, the rate (R), and the 
À value and is oblivious to D and J values. 

Overall due to the RDO, at one end of the distortion spectrum are the intra-coded 
blocks which undergo light distortion and yield the highest number of bits, and at the 
other end are the skipped blocks which have the same motion vectors and parameters 
(i.e., the quantization parameter and the quantized transform coefficients) as their 
spatially preceding blocks, thereby requiring only a few bits to encode at the expense 
of a higher distortion. 

Quantization: In video coding, the transformation and quantization steps are 
designed to minimize computational complexity so that they can be performed using 
limited-precision integer arithmetic. Therefore, an integer transform is used, which is 
then merged with quantization in a single step so that it can be realized efficiently on 
hardware. As a result of this optimization, the actual quantization step size cannot be 
selected by the user but is controlled indirectly as a function of the above-stated quan- 
tization parameter, Q P. In practice, the user mostly decides on the bitrate of coded 
video and the encoder varies the quantization parameter so that the intended rate can 
be achieved. Thus, the quantization parameters might change several times when 
coding a video. Moreover, a uniform quantizer step is applied to every coefficient in 
a4 x 4 or 8 x 8 block by default. However, frequency-dependent quantization can 
also be performed depending on the selected compression profile. 

Loop filtering: Block-wise quantization performed during encoding yields a 
blockiness effect on the output pictures. Both coding standards deploy an in-loop 
filtering operation to suppress this artifact by applying a spatially variant low pass 
filter to smooth the block boundaries. The strength of the filter and the number of 
pixels affected by filtering depend on several constraints such as being at the bound- 
ary of a macroblock, the amount of quantization applied, and the gradient of image 
samples across the boundary. It must be noted that the loop filter is also deployed 
during encoding to ensure synchronization between encoder and decoder. In essence, 
the decoder reconstructs each macroblock by adding a residual signal to a reference 
block identified by the encoder during prediction. In the decoder, however, all the 
previously reconstructed blocks would have been filtered. To not create such a dis- 
crepancy, the encoder also performs prediction assuming filtered block data rather 
than using the original macroblocks. The only exception to this is the intra-frame 
prediction where extrapolated neighboring pixels are taken from an unfiltered block. 

Overall, the reliability of an estimated PRNU pattern from a picture of a video 
depends on the amount of distortion introduced during coding. This distortion is 
mainly induced by the RDO algorithm at the block level and by the loop filtering at 
block boundaries depending on the strength of compression. Therefore, the disruptive 
effects of these operations must be countered. 
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5.4.2 Compensation of Loop Filtering 


The strength of the blocking effect at the block boundaries of a video frame is crucially 
determined by the level of compression. In high-quality videos, such an artifact will 
be less apparent, so from a PRNU estimation point of view, the presence of loop 
filtering will be of less concern. However, for low-quality videos, the application of 
loop filtering will result in the removal of a significant portion of the PRNU pattern, 
thereby making source attribution less reliable (Tan et al. 2016). 

Earlier research explored this problem by focusing on the removal of blocking 
and ringing noise (Chen et al. 2007). Assuming most codecs use fixed 16 x 16 
sized macroblocks, a frequency domain method was proposed to remove PRNU 
components that exhibit periodicity. To better suppress compression artifacts, Hyun 
et al. (2012) proposed an alternative approach utilizing a minimum average cor- 
relation energy (MACE) filter which essentially minimizes the average correlation 
plane energy, thereby yielding higher PCE values. By applying a MACE filter to the 
estimated pattern, a 10% improvement in identification accuracy is compared to the 
results of Chen et al. (2007). Later, Altinisik et al. (2018) took a proactive approach 
that involves intervention in the decoding process as opposed to performing post- 
estimation processing. 

A simplified decoding schema for H.264 and H.265 codecs is given in Fig. 5.2. 
In this model, the coded bit sequence is run through entropy decoding, dequantiza- 
tion, and inverse transformation steps before the residual block and motion vectors 
are obtained. Finally, video frames are reconstructed through the recursive motion 
compensation loop. The loop filtering is the last processing step before the visual 
presentation of a frame. Despite the fact that the loop filter is primarily related to 
the decoder side, it is also deployed on the encoder side to be compatible with the 
decoder. On the encoder side, a decoded picture buffer is used for storing reference 
frames and those pictures are all loop filtered. Inter-prediction is carried out on loop 
filtered pictures. 

It can be seen in Fig. 5.2 that for intra-prediction, unfiltered blocks are used. Since 
I type macroblocks that appear in all frames are coded using intra-frame prediction, 
bypassing the loop filter during decoding will not introduce any complications. How- 
ever, for P and B type macroblocks since the encoder performs prediction assuming 
filtered blocks, simply removing the loop filter at the decoder will not yield a cor- 
rect reconstruction. To compensate for this behavior that affects P and B frames, 
the decoding process must be modified to reconstruct both filtered and non-filtered 
versions of each macroblock. Filtered macroblocks must be used for reconstructing 
future macroblocks, and non-filtered ones need to be used for PRNU estimation.” 

It must be noted that due to this modification, video frames will now exhibit 
blocking artifacts. However, since (sub-)macroblock sizes are not deterministic and 
partitioning of macroblocks into smaller blocks is mainly decided by the RDO algo- 


2 An implementation of loop filter compensation capability for H.264 and H.265 decoding 
modules in the FFMPEG libraries can be obtained at https://github.com/VideoPRNUExtractor/ 
LoopFilterCompensation. 
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Fig. 5.2 Simplified schema of decoding for H.264 and H.265 codecs. In the block diagram, for 
H.264, the filtering block represents the deblocking loop filter and for H.265, it involves a similar 
deblocking filter cascaded with the sample adaptive offset filter 


rithm, PRNU estimates containing blocking artifacts are very unlikely to exhibit a 
structure that is persistent across many consecutive frames. Hence, estimation of 
a reference PRNU pattern from several video frames will suppress these blocking 
artifacts through averaging. 


5.4.3 Coping with Quantization-Related Weakening of PRNU 


The quantization operation applied to prediction error for picture blocks is the key 
contributing factor to the weakening of the underlying PRNU pattern. A relative com- 
parison between the effects of video and still image compression on PRNU estimation 
is provided in Altinisik et al. (2020) in terms of the change in the mean squared error 
(MSE) and PCE values computed between a set of original and compressed images. 
Table 5.1 shows the compression parameter pairs, i.e., QP for H.264 and QF for 
the JPEG, which yield similar MSE and PCE values. It can be seen that H.264 cod- 
ing, in comparison to JPEG coding, maintains a relatively high image quality, despite 
rapidly weakening the inherent PRNU pattern. Essentially, these measurements show 
that what would be considered medium level in video compression is equivalent to 
heavy JPEG compression. This essentially indicates why the conventional method 
of PRNU estimation is not very effective on videos. 

Several studies have proposed ways to tackle this problem. In van Houten and 
Geradts (2009), the authors studied the effectiveness of the PRNU estimation and 
detection process on singly and multiply compressed videos. For this purpose, videos 
captured by webcams were compressed using a variety of codecs at varying reso- 
lutions and matched with the reference PRNU patterns estimated from raw videos. 
The results of that study highlight the dependence on the knowledge of encoding 
settings for successful attribution and the non-linear relationship between quality 
and the detection statistic. 

Chuang et al. (2011) explored the use of different types of frames in PRNU esti- 
mation and found that I frames yield more reliable PRNU patterns than predicted P 
and B frames and suggested giving a higher weight to the contribution of I frames 
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Table 5.1 A comparative evaluation of JPEG and H.264 encoding in terms of MSE and PCE 


Equalized MSE Equalized PCE 

QFIPEG Q PH.264 QFIPEG Q PH.264 
92 5 92 5 
92 10 90 10 
91 15 84 15 
90 18 TI 18 
89 21 66 21 
87 24 50 24 
82 27 33 27 
TI 30 22 30 
67 35 10 35 
61 40 3 40 
57 45 1 45 
57 50 1 50 


during estimation. In Altinisik et al. (2018), macroblock level compression informa- 
tion is incorporated into estimation by masking out heavily compressed blocks. A 
similar masking approach was proposed by Kouokam et al. (2019) in which blocks 
that lack high-frequency content in the prediction error are eliminated from PRNU 
estimation. Later, Altinisik et al. (2020) examined the relationship between the quan- 
tization parameter of a block and the reliability of the PRNU pattern to introduce a 
block level weighting scheme that adjusts the contribution of each block accordingly. 
To better utilize block level parameters, the same authors in Altinisik et al. (2021) 
considered different PRNU weighting schemes based on encoding block type and 
coding rate of a block to provide a more reliable estimation of the PRNU pattern. 
Since encoding is performed at the block level, its weakening effect on the PRNU 
pattern must also be compensated at the block level by essentially weighting the 
contribution of each block in accordance with the level of distortion it has under- 
gone. It must be emphasized here that the block-based PRNU weighting approach 
disregards the picture content and only utilizes decoding parameters. Hence, other 
approaches proposed to enhance the strength of PRNU by suppressing content inter- 
ference (Li 2010; Kang et al. 2012; Lin and Li 2016) and improving the denoising 
performance (Al-Ani et al. 2015; Zeng and Kang 2016; Kirchner and Johnson 2019) 
are complementary to it. The PRNU weighting approach can be incorporated into 
the conventional PRNU estimation method (Chen et al. 2008) by simply introducing 
a block-wise mask for each picture. When obtaining a reference pattern from an 
unstabilized (or very lightly stabilized) video, this can be formulated as 


g = Èi li x Wi x Mi 5.3 
Nd? x M. ` = 
i=1 Vi i 
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where K is the sensor-specific PRNU factor computed using a set of video pictures 
I,,..., Iy; Wi represents the noise residue obtained after denoising picture J;; and 
M; is a mask to appropriately weigh the contribution of each block based on different 
block level parameters. It must be noted that the weights that comprise the mask can 
take values higher than one; therefore, to prevent a blow-up, the mask term is also 
included in the denominator as a normalization factor. When the video is stabilized, 
however, Eq. (5.3) cannot be used since each picture would have been subjected to a 
different stabilization transformation. Therefore, each frame needs to be first inverse 
transformed and Eq. (5.3) must be evaluated only after applying the identified inverse 
transformation to both J; and the mask M;. 

Ultimately, the distortion induced to a block as a result of the RDO serves as 
the best estimator for the strength of the PRNU. This distortion term, however, is 
unknown at the decoder. Therefore, when devising a PRNU weighting scheme, avail- 
able block level coding parameters must be used. This essentially leads to several 
weighting schemes based on masking out skipped blocks, the quantization strength, 
and the coding bitrate of a block. Developing a formulation that relates these block 
level parameters directly to the strength of the PRNU is in general analytically 
intractable. Therefore, one needs to rely on measurements obtained by changing 
coding parameters in a controlled manner and determine how the estimated PRNU 
pattern affects the accuracy of the attribution. 

Eliminating Skipped Blocks: Skip blocks do not transmit any block residual 
information. Hence, they provide larger bitrate savings at the expense of a higher 
distortion when compared to other block types. As a consequence, the PRNU pattern 
extracted from skipped blocks will be the least reliable. Nevertheless, the distortion 
introduced by substituting a skip block with a reference block cannot be arbitrarily 
high as it is bounded by the RDO algorithm, Eq. (5.1). This weighting scheme, 
therefore, deploys a binary mask M where all pixel locations corresponding to a 
skipped block are set to zero and those corresponding to coded blocks are assigned 
a value of one. A major concern with the use of this scheme is that at lower bitrates, 
skip blocks are expected to be much more frequently encountered than other types 
of blocks, thereby potentially leaving very few blocks to reliably estimate a PRNU 
pattern. The rate of the skipped blocks in a video, i.e., the ratio of the number of 
skipped blocks to the total number of blocks, depending on the coding bitrate is 
investigated in Altinisik et al. (2021). Accordingly, it is determined that for videos 
with a bitrate lower than 1 Mbps, more than 70% of the blocks are skipped during 
coding. This finding overall indicates that unless the video is extremely short in 
duration or encoded at a very low bitrate, elimination of skipped blocks will not 
significantly impair the PRNU estimation. 

Quantization-Based PRNU Weighting: Quantization of the prediction residue 
associated with a block is the main reason behind the weakening of the PRNU pat- 
tern during compression. To exploit this, Altinisik et al. (2020) used the strength of 
quantization as the basis of the weighting scheme and empirically obtained a relation 
between the quantization parameter Q P and the PCE, which reflects the reliability of 
the estimated PRNU pattern. The underlying relationship is obtained by re-encoding 
a set of high-quality videos at all possible, i.e., 51, QP levels. Then, average PCE 
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values between video pictures coded at different Q P values and the camera reference 
pattern are computed separately for each video. To suppress the camera-dependent 
variation in PCE values, the obtained average PCE values are normalized with respect 
to the PCE value obtained for a fixed QP value. The resulting normalized average 
PCE values are further averaged across all videos to obtain a camera-independent 
relationship between PCE and QP values. Finally, the obtained Q P-PCE relation- 
ship is converted into a weighting function by taking into account the formulation 
of the PCE metric as shown in Fig. 5.3 (red curve). Correspondingly, the frame level 
mask M in Eq. (5.3) is obtained by filling all pixel locations corresponding to a block 
with a particular Q P value with the respective values on this curve. 

Quantization-Based PRNU Weighting Without Skipped Blocks: Another weight- 
ing scheme that follows the previous two schemes is the quantization-based weighting 
of non-skipped blocks. Repeating the same evaluation process for quantization-based 
PRNU weighting while excluding skipped blocks, the weighting function given in 
Fig. 5.3 (blue curve) is obtained (Altinisik et al. 2021). As can be seen, both curves 
exhibit similar characteristics, although in this case, the weights vary over a smaller 
range. This indicates that the variation in the strength of the PRNU estimated across 
all coded blocks is less variable. This is mainly because eliminating skipped blocks 
result in higher PCE values at higher Q P values where compression is more severe 
and skipped blocks are more frequent. Therefore, when the resulting Q P-PCE values 
are normalized with respect to a fixed PCE value, it yields a more compact weight 
function. This new weight function is used in a similar way when creating a mask 
for each video picture with one difference that for skipped blocks the corresponding 
mask values are set to zero regardless of the Q P value of the block. 

Coding Rate (\R)-Based PRNU Weighting: Although the distortion introduced 
to a block can be inferred from its type or through the strength of quantization it has 
undergone (QP), the correct evaluation of the reliability of the PRNU ultimately 
requires incorporating the RDO algorithm. Since the terms D and J, in Eq. (5.1), are 
unknown at the decoder, Altinisik et al. (2021) considered using AR which can be 
determined using the number of bits spent for coding a block and the block’s QP. 
Similar to previous schemes, the relationship between AR and PCE is empirically 
determined by re-encoding the raw-quality videos at 42 different bitrates varying 
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between 200 Kbps and 32 Mbps and obtaining a camera-independent weighting 
function as displayed in Fig. 5.4 to create a mask for weighting PRNU blocks. 
Comparison: Tests performed on a set of videos coded at different bitrates showed 
that block-based PRNU weighting schemes yield a better estimate of the PRNU both 
in terms of the average improvement in PCE values and in the number of videos 
that yield a PCE value above 60 when compared to the conventional method, even 
after performing loop filter compensation (Altinisik et al. 2021). It can be observed 
that the improvement in PCE values due to PRNU weighting becomes more visible 
at bitrates higher than 900 Kbps as shown in Fig.5.5. Overall, at all bitrates, the 
rate associated with a block is determined to be a more reliable estimator for the 
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strength of extracted PRNU pattern. Similarly, the performance of the Q P-based 
weighting scheme consistently improves when skipped blocks are eliminated from 
the estimation process. 

It was also found that at very high compression rates (at coding rates less than 
0.024 bits per pixel per frame), elimination of skipped blocks performs comparably 
to a coding rate-based weighting scheme. This finding essentially indicates that at 
high compression rates, PRNU patterns extracted from coded blocks are all equally 
weakened, and a block-based weighting approach does not provide an additional 
gain. Tests also showed that the improvement in the performance due to block- 
based PRNU weighting does not come at the expense of an increased number of 
false-positive matches as demonstrated by the resulting ROC curves presented in 
Fig.5.6. The gap between the two curves that correspond to the conventional method 
(tailored for photographic images) and the use of rate-based PRNU weighting scheme 
also reveals the importance of incorporating video coding parameters into PRNU 
estimation. 


5.5 Tackling Digital Stabilization 


Most cameras deploy digital stabilization while recording a video. With digital sta- 
bilization, frames captured by the sensor are moved and warped to align with one 
another through in-camera processing. Stabilization can also be applied externally 
with greater freedom in processing on a computer using one of several video edit- 
ing tools or in the cloud, such as performed by YouTube. In all cases, however, the 
application of such frame level processing introduces asynchronicity among PRNU 
patterns of consecutive frames of a video. 

Attribution of digitally stabilized videos, thus, requires an understanding of the 
specifics of how stabilization is performed. This, however, is a challenging task as 
the inner workings and technical details of processing steps of camera pipelines are 
usually not revealed. Nevertheless, at a high level, the three main steps of digital 
stabilization involve camera motion estimation, motion smoothing, and alignment 
of video frames according to the corrected camera motion. Motion estimation is per- 


106 H. T. Sencar 


formed either by describing the geometric relationship between consecutive frames 
through a parametric model or through tracking key feature points across frames to 
obtain feature traJectories (Xu et al. 2012; Grundmann et al. 2011). With sensor-rich 
devices such as smartphones and tablets, data from motion sensors are also utilized 
to improve the estimation accuracy (Thivent et al. 2018). This is followed by the 
application of a smoothing operation to estimate camera motion or the obtained fea- 
ture trajectories to eliminate the unwanted motion. Finally, each frame is warped 
according to the smoothed motion parameters to generate the stabilized video. 

The most critical factor in stabilization depends on whether the camera motion 
is represented by a two-dimensional (2D) or three-dimensional (3D) model. Early 
methods mainly relied on the 2D motion model that involves application of full- 
frame 2D transformations, such as affine or projective models, to each frame during 
stabilization. Although this motion model is effective for scenes far away from the 
camera where parallax is not a concern, it cannot be generalized to more complicated 
scenes captured under spatially variant camera motion. To overcome 2D modeling 
limitations, more sophisticated methods have considered 3D motion models. How- 
ever, due to difficulties in 3D reconstruction, which requires depth information, these 
methods introduce simplifications to 3D structure and rely heavily on the accuracy 
of feature tracking (Liu et al. 2009, 2014; Kopf 2016; Wang et al. 2018). Most crit- 
ically, these methods involve the application of spatially variant warping to video 
frames in a way that preserves the content from distortions introduced by such local 
transformations. 

In essence, digital image stabilization tries to align content in successive pictures 
through geometric registration, and the details of how this is performed depend on the 
complexity of camera motion during capture. This may vary from the application of a 
simple Euclidean transformation (scale, rotation, and shift applied individually or in 
combination) to spatially varying warping transformation in order to compensate for 
any type of perspective distortion. As these transformations are applied on a per-frame 
basis and the variance of camera motion is high enough to easily remove pixel-to- 
pixel correspondences among frames, alignment or averaging of frame level PRNU 
patterns will not be very effective in estimating a reference PRNU pattern. Therefore, 
performing source attribution in a stabilized video requires blindly determining and 
inverting those transformations applied to each frame. 

At the simplest level, an affine motion model can be assumed. However, the 
deficiency of affine transformation-based stabilization solutions have long been a 
motivation for research in the field of image stabilization. Furthermore, real-time 
implementation of stabilization methods has always been a concern. With the con- 
tinuous development of smartphones and each year’s models packing more comput- 
ing power, it is hard to make an authoritative statement on the state of stabilization 
algorithms. In order to develop a better understanding, we performed a test using the 
Apple iMovie video editing tool that runs on Mac OS computers and iOS mobile 
devices. With this objective, we shot a video by panning the camera around a still 
indoors scene while stabilization and electronic zoom were turned off. The video 
was then stabilized by iMovie at 10% stabilization setting which determines the 
maximum amount of cropping that can be applied to each frame during alignment. 
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Fig. 5.7 KLT feature point displacements in two sample frames in a video stabilized using the 
iMovie editing tool at 0.1 stabilization setting. Displacements are measured in reference to stabilized 
frames 


To evaluate the nature of warping applied to video frames, we extracted Kanade- 
Lucas—Tomasi (KLT) reference feature points, which are frequently used to estimate 
the motion of key points (Tomasi and Kanade 1991). Then, displacements of KLT 
points in pre- and post-stabilized video frames were measured. Figure 5.7 shows the 
optical flows estimated using KLT points for two sample frames. It can be seen that 
KLT points in a given locality move similarly, mostly inwards due to cropping and 
scaling. However, a single global transformation that will cover the movement of all 
points seems unlikely. In fact, our attempts to determine a single warp transforma- 
tion to map key points in successive frames failed with only 4—5, out of the typical 
50, points resulting in a match. This finding overall supports the assumption that 
digital image stabilization solutions available in today’s cameras may be deploying 
sophisticated methods that go beyond frame level affine transformations. 

To take into account more complex stabilization transformations, projective trans- 
formations can be utilized. A projective transformation can be represented by a3 x 3 
matrix with 8 free parameters that specify the amount of rotation, scaling, translation, 
and projection applied to a point in a two-dimensional space. In this sense, affine 
transformations form a subset of all such transformations without the two-parameter 
projection vector. Since a transformation is performed by multiplying the coordinate 
vector of a point with the transformation matrix, its inversion requires determination 
of all these parameters. This problem is further exacerbated when transformations 
are applied in a spatially variant manner. Since in that case different parts of a pic- 
ture will have undergone different transformations, transform inversion needs to be 
performed at the block level where encountered PCE values will be much lower. 


5.5.1 Inverting Frame Level Stabilization Transformations 


Attribution for images subjected to geometric transformations has been previously 
studied earlier in a broader context. Considering scaled and cropped photographic 
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images, Goljan et al. (2008) proposed a brute-force search for the geometric trans- 
form parameters to re-establish alignment between PRNU patterns. To be able to 
achieve this, the PRNU pattern obtained from the image in question is upsampled 
in discrete steps and matched with the reference PRNU at all shifts. The values that 
yield the highest PCE are identified as the correct scaling factor and the cropping 
position. More relevantly, by focusing on panoramic images, Karakucuk et al. (2015) 
investigated source attribution under more complex geometric transformations. Their 
work showed the feasibility of estimating inverse transform parameters considering 
projective transformations. 

In the case of stabilized videos, Taspinar et al. (2016) proposed determining the 
presence of stabilization in a video by extracting reference PRNU patterns from the 
beginning and end of a video and by testing the match of the two patterns. If stabiliza- 
tion is detected, one of the I frames is designated as a reference and it is attempted to 
align other I frames to it through a search of inverse affine transformations to correct 
for the applied shift and rotation. The pattern obtained from the aligned I frames is 
then matched with a reference PRNU pattern obtained from a non-stabilized video 
by performing another search. 

Iuliani et al. (2019) introduced another source verification method similar to 
Taspinar et al. (2016) by also considering mismatches in resolution, thereby effec- 
tively identifying stabilization and downsizing transformations in a combined man- 
ner. To perform source verification, 5—10 I frames are extracted and corresponding 
PRNU patterns are aligned with the reference PRNU pattern by searching for the 
correct amount of scale, shift, and cropping applied to each frame. Those frames 
that yield a matching statistic above some predetermined PCE value are combined 
together to create an aligned PRNU pattern. Tests performed on videos captured by 8 
cameras in the VISION dataset using the first fiveframes of each video revealed that 
86% of videos captured by cameras that support stabilization in the reduced dataset 
can be correctly attributed to their source with no false positives. This method was 
also shown to be effective on a subset of videos downloaded from YouTube with an 
overall accuracy of 87.3%. 

To extract a reference PRNU pattern from a stabilized video, Mandelli et al. (2020) 
presented a method considering stabilization transformations in the form of affine 
transformations. In this approach, PRNU estimates obtained from each frame are 
matched with other frames in a pair-wise manner to identify those translated with 
respect to each other. Then the largest group of frames that yield a sufficient match are 
combined together to obtain an interim reference PRNU pattern, and the remaining 
frames are aligned with respect to this pattern. Alternatively, when the reference 
PRNU pattern at a different resolution is already known, then it is used as a reference 
and PRNU patterns of all other frames are matched by searching for transformation 
parameters. For verification, five I frames extracted from the stabilized video are 
matched to this reference PRNU pattern. If the resulting PCE values for at least one 
of the frames is observed to be higher than the threshold, a match is assumed to 
be achieved. The results obtained on the VISION dataset show that the method is 
effective in successfully attributing 71% and 77% of videos captured by cameras 
that support stabilization with 1% false-positive rate, respectively, when 5 and 10 
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I frames are used. Alternatively, if the reference PRNU pattern is extracted from 
photos, rather than flat videos, under the same conditions, attribution rates are found 
to increase to 87% for 5 frames and to 91% for 10 frames. 

A note concerning the use of videos in the VISION dataset is the lack of labeling for 
videos. The above-mentioned studies (i.e., Iuliani et al. 2019 and Mandelli et al. 2020) 
considered all videos captured by cameras that support stabilization as stabilized 
and reported performance results accordingly. However, the number of weakly and 
strongly stabilized videos in the VISION dataset is smaller than the overall number 
of 257 videos captured by such cameras. In fact, it is demonstrated in Altinisik and 
Sencar (2021) that 105 of those videos are not stabilized or very lightly stabilized 
(as they can be attributed using the conventional method after compensation of the 
loop filtering), 108 are stabilized using frame level affine transformations, and the 
remaining 44 are more strongly stabilized. Hence, by including the 105 videos in their 
evaluation, these approaches inadvertently yielded higher attribution performance for 
stabilized videos. 


5.5.2 Inverting Spatially Variant Stabilization 
Transformations 


This form of stabilization may cause different parts of a frame to undergo different 
warpings. Therefore, tackling it requires inverting the stabilization transformation 
while being restricted to operate at sub-frame level. To address this problem, Altinisik 
et al. (2021) proposed a source verification method that allows an efficient search of a 
large range of projective transformations by operating at the block level. Considering 
different block sizes, they determined that blocks of size less than 500 x 500 pixels 
yield extremely low PCE values and cannot be used to reliably identify transformation 
parameters. Therefore, in their search, a block of a video frame is inverse transformed 
repetitively and the transformation that yields the highest PCE between the inverse 
transformed block and the reference PRNU pattern is identified. 

During the search, rather than changing transformation parameters blindly, which 
may lead to unlikely transformations and necessitate interpolation to convert non- 
integer coordinates to integer values, only transformations that move corner vertices 
of the block within a search window are considered. It is assumed that coordinates 
of each corner of a selected block may move independently within a window of 
15 x 15 pixels, i.e., spanning a range of +7 pixels in both coordinates in respect of 
the original position. Considering the possibility that a block might have also been 
subjected to a translation (e.g., introduced by a cropping) not contained within the 
searched space of transformations, each inverse transformed block is also searched 
within a shift range of +50 pixels in all directions in the reference pattern. 

To accelerate the search, instead of performing a pure random search over all 
transformation space, the method adopts a three-level hierarchical grid search. With 
this approach, the search space over rotation, scale, and projection is coarsely sam- 
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pled. In the first level, each corner coordinate is moved by +4 pixels (in all directions) 
over a coarse grid to identify five transformations (out of 9* possibilities) that yield 
the highest PCE values. A higher resolution search is then performed by the same 
process over neighboring areas of the identified transformations on a finer grid by 
changing the corner coordinates of transformed blocks +2 and, again, retaining only 
the five transformations producing the five highest values. Finally, in the third level, 
coarse transformations determined in the previous level are further refined by consid- 
ering all neighboring pixel coordinates (around a +1 range) to identify the most likely 
transformations needed for inverting the warping transformation due to stabilization. 
This overall reduced the number of transformations from 158 to 11 x 94 possibilities. 
A pictorial depiction of the grid partitioning of the transformation space is shown 
in Fig.5.8. The search complexity is further lowered by introducing a two-level grid 
search approach that provides an order of magnitude decrease in computation, from 
11 x 94 to 11 x 93 transformations, without a significant change in the matching 
performance. 

The major concern with increasing the degree of freedom when inverting stabiliza- 
tion transformation is the large number of transformations that must be considered to 
correctly identify each transformation. Essentially, this causes an increase in false- 
positive matches, as multiple inverse transformed versions of a PRNU block (or 
frame) are generated and matched with the reference pattern. Therefore, there is a 
need for measures to eliminate spurious matches arising due to incorrect transfor- 
mations. With this objective, (Altinisik and Sencar 2021) also imposed block and 
sub-block level constraints to validate the correctness of identified inverse transfor- 
mations and demonstrated their effectiveness. 

The results of this approach are used to verify the source of stabilized videos in 
three datasets. Results showed that the introduced method is able to correctly attribute 
23-30% of the stabilized videos in the three datasets that cannot be attributed using 
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Table 5.2 Characteristics of datasets to study stabilization 


Dataset # of Cameras # of unstabilized | # of weakly # of strongly 
videos stabilized videos | stabilized videos 


VISION 
iPhone SE-XR 
APS 


earlier proposed approaches (that assume an affine motion model for stabilization), 
without any false attributions while utilizing only 10 frames from each video. 


5.6 Datasets 


There is a lack of large-scale datasets to enable test and development of video source 
attribution methods. This is mainly because there are very few resources on the 
Internet that provide original videos compared to photographic images. In almost 
all public video sharing sites, video data are re-encoded to reduce the storage and 
transmission requirements. Furthermore, video metadata typically exclude source 
camera information, adding another layer of complication to data collection from 
open sources. 

The largest available dataset is the VISION dataset, which includes a collection 
of photos and videos captured by 35 different camera models (Shullani et al. 2017). 
However, this dataset is, however, not comprehensive enough to cover the diver- 
sity of compression and stabilization behavior of available cameras. To tackle the 
compression problem, (Altinisik et al. 2021; Tandogan et al. 2019) utilized a set 
of raw-quality videos captured by 20 smartphone cameras through a custom cam- 
era app under controlled conditions. These videos were then externally compressed 
under different settings.* To study the effects of stabilization, (Altinisik and Sencar 
2021) introduced two other datasets in addition to the VISION dataset. One of these 
includes media captured by two iPhone camera models (the iPhone SE-XR dataset) 
and the other includes a set of externally stabilized videos using the Adobe Premiere 
Pro video processing tool (the APS dataset). Table 5.2 provides the stabilization 
characteristics of the videos in those datasets. 


3 These videos are available at https://github.com/VideoPRNUExtractor. 
4 The iPhone SE-XR and the APS datasets are also available at the above link. 
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5.7 Conclusions and Outlook 


The generation of a video involves several processing steps that are growing in both 
sophistication and number. These processing steps are known to be very disruptive 
to PRNU estimation, and there is little public information about their specifics. 
Therefore, developing a PRNU estimation method for videos that is as accurate 
and efficient as it is for photographic images is an extremely challenging task. 

Current research focusing on this problem has mainly attempted to mitigate down- 
sizing, stabilization, and video coding operations. Overall, the reported results of 
these studies have shown that it is difficult to generate a reference PRNU pattern 
from a video, specifically for stabilized and low-to-medium compressed videos. 
Therefore, almost all approaches to attribution of stabilized videos assume a source 
verification setting where a given video is matched against a known camera. That is, 
a reference PRNU pattern of the sensor is assumed to be available. 

In a source verification setting, with the increase in the number of frames in a 
video, the likelihood of correctly attributing a video also increases. However, this 
does not remove the need to devise a PRNU estimation method. To emphasize this, 
we tested how the conventional PRNU estimation method (developed for photos, 
without the Wiener filtering component) performs when the source of all frames 
comprising strongly stabilized videos in the three public datasets mentioned above 
is verified using the corresponding camera reference PRNU patterns. These results 
showed that out of more than 80K frames tested, only two frames yielded a PCE value 
above the commonly accepted threshold of 60 (78 and 491) while a great majority 
yielded PCE values less than 10. 

Ultimately, the reliable attribution of videos depends on the severity of information 
loss mainly caused by video coding and the ability to invert geometric transformations 
mostly stemming from stabilization. For highly compressed videos, with bitrates 
lower than 900 kbps, the PRNU is significantly weakened. Similarly, the use of 
smaller blocks or low-resolution frames, with sizes below 500 x 500 pixels, do not 
yield reliable PRNU patterns. Thus, PCE values corresponding to true and false 
matching cases largely overlap. 

Nonetheless, the most challenging aspect of source attribution remains stabiliza- 
tion. Since most modern smartphone cameras activate stabilization when the camera 
is set to video mode, before video recording starts, its disruptive effects should be 
considered to be always present. Overall, coping with stabilization induces a signif- 
icant computational overhead to PRNU estimation which is mainly determined by 
the assumed degree of freedom in the search for applied transformation parameters. 
More critically, this warping inversion step amplifies the number of false attributions, 
which forces the use of higher threshold values for correct matching, thereby further 
reducing the accuracy. This underscores the need to develop methods for an effective 
search of transformation space to determine the correct transformation and trans- 
formation validation mechanisms to eliminate false-positive matches as a research 
priority. 
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Overall, the PRNU-based source attribution of photos and videos is an important 
capability in the arsenal of media forensics tools. Despite significant work in this 
field, reliable attribution of a video to its source camera remains a clear problem. 
Therefore, further research is needed to increase the applicability of this capability 
into practice. 
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Chapter 6 A) 
Camera Identification at Large Scale get 


Samet Taspinar and Nasir Memon 


Chapters 3 and 4 have shown that PRNU is a very effective solution for source 
camera verification, i.e., linking a visual object to its source camera (Lukas et al. 
2006). However, in order to determine the source camera, there has to be a suspect 
camera in the possession of a forensics analyst which is often not the case. On the 
other hand, in the source camera identification problem, the goal is to match a PRNU 
fingerprint to a large database. This capability is needed when one needs to attribute 
one or more images from an unknown camera to a large number of images in a 
large image repository to find other images that may have been taken from the same 
camera. For example, consider that a legal entity acquires some illegal visual objects 
(such as child pornography) without any suspect camera available. Consider also that 
the owner of the camera shares images or videos on social media such as Facebook, 
Flickr, or YouTube. In such a scenario, it becomes crucial to find a way to link the 
illegal content to those social media accounts. Given that these social media contains 
billions of accounts, camera identification at large scale becomes a very challenging 
task. 

This chapter will focus on how databases that support camera identification can 
be created and the methodologies to search query images on those databases. Along 
with that, the time complexities, strengths, and drawbacks of different methods will 
also be presented in this chapter. 


S. Taspinar 
Overjet, Boston, MA, USA 
e-mail: samet @ overjet.ai 


N. Memon (BX) 
New York University Tandon School of Engineering, New York, NY, USA 
e-mail: memon @nyu.edu 


© The Author(s) 2022 117 
H. T. Sencar et al. (eds.), Multimedia Forensics, Advances in Computer Vision and Pattern 
Recognition, https://doi.org/10.1007/978-981-16-7621-5_6 


118 S. Taspinar and N. Memon 


6.1 Introduction 


As established in Chaps. 3 and 4, once a fingerprint, K, is obtained from a camera, 
the PRNU noise obtained from other query visual objects can be tested against the 
fingerprint to determine if they are captured by the same camera. Each of these 
comparisons is a verification task (i.e., 1-to-1). 

On the other hand, in the identification task, the source of a query image or video is 
searched within a fingerprint collection. The collection could be compiled from social 
media such as Facebook, Flickr, and YouTube. Visual objects from each camera in 
such a collection can be clustered together, and their fingerprints extracted to create 
a known camera fingerprint database. This chapter will discuss identification on a 
large scale which can be seen as a sequence of multiple verification tasks (i.e., 1-to- 
many). Specifically, we will focus on structuring large databases of images or camera 
fingerprints and the search efficiency of different methods using such databases. 

Notice that search can mainly be done in two ways: querying one or more images 
from a camera against a database of camera fingerprints or querying a fingerprint 
against a database of images. This chapter will focus on the first case where we will 
assume there is only a single query image whose source camera is under question. 
However, the workflow is the same for the latter case as well. 

Over the past decade, various approaches have been proposed to speed up camera 
identification using large databases of camera fingerprints. These methods can be 
grouped under two categories: (i) techniques for decreasing the cost of pairwise 
comparisons and (ii) decreasing number of comparisons made. Along with these a 
third category aims at combining the strengths of two approaches to create a superior 
method. 

We will assume that neither query images nor the images used for computing 
the fingerprint are geometrically transformed unless otherwise stated. As described 
in Chap.3, some techniques can be used to reverse the geometric transformations 
performed before searching the database. However, not all algorithms proposed in 
this chapter will work, and geometric transformations may cause a failure in the 
methodology. We will also assume that there is no camera brand/model information 
available as metadata typically easy to forge and often does not exist when visual 
objects are obtained from social media. However, when reliable metadata is available, 
it can speed up search further by restricting to the relevant models in the obvious 
manner. 


6.22 Naive Methods 


This section will present the most basic methods for source camera identification. 
Consider that images or videos from N known cameras are obtained. Their fin- 
gerprints are extracted as a one-time operation and stored in a fingerprint dataset, 
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D = Ki, K2, ... KN to form a fingerprint dataset. This step is common for all the 
methods presented in this and other camera identification methods. 


6.2.1 Linear Search 


The most straightforward search method for source camera identification is linear 
search (brute force). A query visual object with n pixels, whose fingerprint estimate 
is K4, is then compared with all the fingerprints until (i) it matches with one of them 
(i.e., at least one H, case) (ii) no more fingerprints left (i.e., all comparisons are Ho 
cases) 


Ho : K; £ K, (non-matching fingerprints) (6.1) 
H. : K; = K, (matching fingerprints) : 

So the complexity of the linear search is O(nN) for a single query image. Since 
modern cameras typically have more than 10M pixels, and the number of cameras 
in a comprehensive database could be many millions, the computational cost of this 
method becomes exceedingly large. 


6.2.2 Sequential Trimming 


Another simple method is to trim all images to a fixed length such that all fingerprints 
become the same length k where k < n (Goljan et al. 2010). The main benefits of 
this method are it would speed up the search by n/k times, and the memory and disk 
usage drop by the same ratio. 

Given a camera fingerprint, K;, and a fingerprint of a query object, K,, their cor- 
relation c4 is independent of the number of remaining pixels after trimming (i.e., k). 
However, as k drops, the probability of false alarm, Pr, increases, and the proba- 
bility of detection, Pp decreases for a fixed decision threshold (Fridrich and Goljan 
2010). Therefore, the sequential trimming method may have only limited speed up 
if the goal is to retain performance. 

Moreover, this method still requires the query fingerprint to be correlated with N 
fingerprints in the database (i.e., linear with the number of cameras in the database). 
However, the time the computation complexity drops to O(kN) as fingerprint is 
trimmed from n to k elements. 
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6.33 Efficient Pairwise Correlation 


As presented in Chap.3, the complexity of a single correlation is proportional to 
the number of pixels in a fingerprint pair. Since modern cameras contain millions 
of pixels and a large database can contain billions of images, using naive search 
methods is prohibitively expensive for source camera identification. 

The previous section shows that the time complexity of a brute-force search is 
proportional to the number of fingerprints and number of pixels in each fingerprint 
(i.e., O(n x N)). Similarly, sequential trimming can only speed up the search a few 
times. Hence, further improvements for identification are required. 

This section presents the first approach mentioned in Sect.6.1 that decreases the 
time complexity of a pairwise correlation. 


6.3.1 Search over Fingerprint Digests 


The first method that addresses the above problem is searching over fingerprint digests 
(Goljan et al. 2010). This method reduces the dimensionality of each fingerprint by 
selecting the k largest (in terms of magnitude) fingerprint elements along with their 
spatial location from a fingerprint F. 


Fingerprint Digest 


Given a fingerprint, F, with n fingerprint elements, the proposed method picks the top 
k largest of them as well as their indices. Suppose that the values of these k elements 
are Vr = {v1, v2, ... uk) and their locations are Lr = {lh, h, ...l} for 1 < l, <n 
where Vr = F[L +]. 

The digest of each fingerprint can be extracted and stored along with the original 
fingerprint. An important aspect is determining the number of pixels, k, in a digest. 
Picking a large number of pixels will not yield a high speedup, whereas a low number 
will result in many false positives. 


Search step 


Suppose that the fingerprint of the query image is X and it is correlated with one of 
the digests in the database Vr whose indices are Lpr. In the first step, the digest of X 
is estimated as 


Vy = X[L +] (6.2) 


Then, corr(Vr, Vx) is obtained. If the correlation is lower than a preset thresh- 
old, 7, then X and F are said to be from different sources. Otherwise, the original 
fingerprint X and F are correlated to decide if they originate from the same source 
camera. Overall, this method speeds up approx. k/n times where k << n. Therefore, 
a significant speedup can be achieved. 
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6.3.2 Pixel Quantization 


The previous section presented that search in large databases is proportional to the 
number of images in a database, N and the number of elements each fingerprint 
has, n. To be more precise, together with these variables, the number of bits of 
each fingerprint element also plays a role in the time complexity of the search. 
Taking the number of bits into account, the time complexity of a brute-force search 
becomes O(b x n x N) where b is the number of bits of each element. Therefore, 
another avenue for speeding up pairwise fingerprint comparison is by quantizing the 
fingerprint elements. 

Bayram et al. (2012) proposed a fingerprint binarization method where each fin- 
gerprint element is represented only by its sign, and the magnitude is discarded. 
Hence, each element is represented by either 0 or 1. Given a camera fingerprint, X, 
with n elements, its binary form, XP, can be represented as 


1, X i >= (0) 
where X B is the ith element in binary form for i € {1,... n}. 

This way, the efficiency of pairwise fingerprint matching can be improved by a 
factor of b. The original work (Lukas et al. 2006) uses 64-bit double precision for 
each fingerprint element. Therefore, the binary quantization method can save storage 
usage and computation efficiency by a factor 64 as compared to Lukas et al. (2006). 
However, binary quantization comes with an expense of a decrease in the correct 
detection. 

Alternatively, one can use the quantization approach but with more bits instead 
of a single bit. This way, the metrics can be preserved. When the bit depth of two 
fingerprints is reduced from 64-bit to 8-bit format, the pairwise correlation changes 
marginally (i.e., less than 1074) but nearly 8 times speed up can be achieved as the 
complexity of Pearson correlation is proportional to the number of bits. 


6.3.3 Downsizing 


Another simple yet effective method is creating multiple downsized versions of a 
fingerprint where along with the original fingerprint L other copies of the fingerprint 
are stored (Yaqub et al. 2018; Tandogan et al. 2019). In these methods, a camera 
fingerprint whose resolution is r x c, is downsized by a scaling factor s in the first 
level. Then, in the next level, the scaled fingerprint is further resized by s, and so on. 
Hence, the resolution of a fingerprint in Lth level the resolution becomes x —. 
Since, multiple copies of a fingerprint is stored in this method, the storage usage 
increases by af s9 The authors show that the optimal value for s is 2 as it causes 
the least interpolation artefact, and L is 3 level. Moreover, Lanczos—3 interpolation 
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results in best accuracy as it best approximates the Sinc kernel (Lehman et al. 1999). 
In these settings, the storage requirement increases by ~33%. 


Matching 


Given a query fingerprint and a dataset of N fingerprints (for each one of them 3 
other resized versions are stored), a pairwise comparison can be done as follows: 

The query fingerprint is resized to L3 size (i.e., by 1/8) and compared by L3 
version of ith fingerprint. If the comparison of L3 query and reference fingerprint 
pair is below a preset threshold, the query image is not captured by the ¿th camera, and 
the query fingerprint is then compared by (i + 1)th reference fingerprint. Otherwise, 
L2 fingerprint are compared. The same process is continued for other levels as well. 
The preset PCE threshold is set to 60, as presented in Goljan et al. (2009). 

The results show that when cropping is not involved, this method can achieve 
up to 53 times speed up with a small drop in the true positive rate. 


6.3.4 Dimension Reduction Using PCA and LDA 


It is known that sensor noise is high-dimensional data (i.e., more than millions of pix- 
els). This data contains redundant and interfering components such as demosaicing, 
JPEG compression, and other in-camera image processing operations. Because of 
these artifacts are the probability of match may decrease, and a significant slow down 
happens. Therefore, removing those redundant or interfering signals helps improve 
the matching accuracy and the efficiency of pairwise comparison. 

Li et al. propose to use Principal Component Analysis (PCA) and Linear Discrim- 
inant Analysis (LDA) to compress camera fingerprints (Li et al. 2018). The proposed 
method first uses PCA-based denoising (Zhang et al. 2009) to create a better repre- 
sentation of camera fingerprints. Then LDA is utilized to decrease the dimensionality 
further as well as further improve matching accuracy. 


Fingerprint Extraction 


Suppose that there are n images, [1 , . . . In}, captured by a camera. These images are 
cropped from their centers such that each becomes N x N resolution. First, PRNU 
noise, x;, extracted from these images using one of extraction methods (Lukas et al. 
2006; Dabov et al. 2009). In this step, the noise signals are flattened to create column 
vectors. Hence, the size of x; becomes N* x 1. They are then horizontally stacked 
to created the training matrix, X as follows: 


X = [x1, X2, ... Xn] (6.4) 
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Feature extraction via PCA 


The mean of X, u, becomes 
1 n 
H= a 2 Xi 
i= 


The training set is normalized by subtracting the mean from each image by x; = 
x; — u. Hence, the normalized training set X becomes 


X =. [X 1, X2, Mae Xn] (6.5) 


The covariance matrix, © of X can be calculated as © = E[X XT] ~ 1X Xt: 

PCA is performed for finding orthonormal vectors, u; and their eigenvalues, Az. 
The dimensionality of X is very high (i.e., N? x n). Hence the computational com- 
plexity of PCA is prohibitively high. Instead of direct computation, singular value 
decomposition (SVD) is applied, which is a close approximation of PCA. SVD can 
be written as 


E = GAO! (6.6) 


where ® = [¢1, %2 . . . m] is the m x m orthonormal eigenvector matrix and A = 
diag[Ài, A2... An} are eigenvalues. 

Along with creating decorrelated eigenvectors, PCA allow to reduce dimensional- 
ity. Using the first d of m eigenvectors, transformation matrix M pca = {1, %2... Qa} 
for d < m. Then the transformed dataset, P becomes Ŷ = MI X whose size is 
d x n. Generally speaking, most of the discriminative information of the SPN signal 
will concentrate on the first few eigenvectors. In this research, the authors kept only 
d eigenvalues to preserve 99% of the variance, which is used to create a compact 
version of the PRNU noise. Moreover, thanks to this dimension reduction, the “con- 
tamination” caused by other in-camera operations can be decreased. The authors call 
this a “PCA feature”. 


Feature extraction via LDA 


LDA is used for two purposes: (i) dimension reduction (ii) increasing the distance 
between different classes. Since the training set is already labeled, LDA can be used 
to reduce dimension and increase separability of different classes. This can be done 
by finding a transformation matrix, Mida that maximizes the ratio of the determinant 
of between-class scatter matrix, Sp to within-classes scatter matrix Sw: 


== — | JTS 
Mida = argmax = pe (6.7) 


where Sy = A eae — u;)(yi — pj)". There are c distinct classes and jth 
class has L; samples, y; is the ith sample of the class j and u; is the mean of the 
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class j. On the other hand Sp is defined as Sp = + > y (u; — p)(uj — Ww)" 
where u is the average of all samples. 

Using the “LDA feature” extractor, Mida, a c — 1 dimensional vector can be 
obtained as 


z = Mj[(M4.4)" X] 


pca 
= (Mea) (Miia) X 


(6.8) 

So, PCA and LDA operations can be combined using only a single operation, 
i.e., Me = (M40) (Mita) Typically (c — 1) is lower than d, which can further 
compress the training data. 


Reference Fingerprint Estimation 


Consider that jth camera, c;, which has captured image I), In, ... I L; and their PRNU 
noise patterns are x), X2, . . . Xz, respectively. 

The reference fingerprint for c; can be obtained using (6.8). So, using (6.8), LDA 
features are extracted (i.e., z; = M. T X). 


Source Matching 


The authors present two different methodologies: Algorithm 1 and 2. These algo- 
rithms make use of y! (PCA-SPN) and z (LDA-SPN), respectively. Both of these 
compressed fingerprints have much lower dimensionality compared to the original 
fingerprint, x. 

The source camera identification process is straightforward for both algorithms. 
“One-time” fingerprint compression step is used to set up the databases, the PCA-SPN 
of query fingerprint, ya , is correlated by each fingerprint in the database. Similarly, for 
LDA-SPN, zç is correlated with fingerprint in the more compressed dataset. Decision 
thresholds 7, and 7, can be setup based on desired false acceptance rate (Fridrich 
and Goljan 2010). 


6.3.5 PRNU Compression via Random Projection 


Valsesia et al. (2015a,b) utilized the idea of applying random projections (RP) fol- 
lowed by binary quantization to reduce fingerprint dimension for camera identifica- 
tion on large databases. 


Dimension Reduction 


Given a dataset, D which contains N n-dimensional fingerprints, the goal is to reduce 
each fingerprint to m-dimensions with slight or no information loss where m << n. 
RP is a method to reduce original fingerprints using a random matrix ® € R”*” 
so that the original dataset, D € R"*% can be compressed to a lower dimensional 
subspace, A € RmxN as follows: 
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A= op (6.9) 


Johnson et al. present that given an n dimensional data, there exists an m dimen- 
sional Euclidean space via a transformation that retains all pairwise distances within 
a factor of 1 + e (Lindenstrauss 1984). 

A linear mapping, f, can be represented by a random matrix ® € R” >". The ele- 
ments of ® can be drawn from certain probability distributions such as Gaussian or 
Rademacher (Achlioptas 2003). However, one of the main drawbacks of generating 
m x n random numbers. Given that n is on the order of several million, generat- 
ing and storing such a random matrix is not feasible as it would require too much 
memory. Moreover, full matrix multiplication is carried out for the dimension reduc- 
tion of each fingerprint. This problem can be overcome by using circulant matrices. 
Instead of generating an entire m x n matrix, only the first row is generated in Gaus- 
sian distribution, and the rest of the matrix can be obtained by circular shift. The 
multiplication after generating ® can be done using Fast Fourier Transform (FFT). 
Thanks to FFT, the dimension reduction can be efficiently done with a complexity 
of O(N x n x logn) as opposed to O(N x m x n). 

Further improvement can be achieved by binarization. Instead of a floating value, 
each element can be represented by a single bit as 


A = sign(®D) (6.10) 


The compressed dataset A is now a matrix consisting of N vectors, each of which 
is a separate fingerprint, K; as 


A = {yi : yi = sign(®K;)},i = {1,..., N} (6.11) 


Search 


Now that database of N fingerprints is set up, the authors present two methods to 
identify the source of a query fingerprint: linear search and hierarchical search. In 
these search methods, a query fingerprint, Ê € R" is first transformed in similar 
manner as the database (i.e., random projection followed by binary quantization) as 
follows: 


$ = sign(®R) (6.12) 


Since the query fingerprint and the N fingerprints in the dataset are compressed to 
their binary form, pairwise correlation can be done using Hamming distance instead 
of Pearson correlation coefficient. 


Linear Search 


The most straightforward approach for search is by brute-force approach. In this 
method, the compressed query fingerprint, $ is compared against each fingerprint 
in A, y;. Therefore, similar to the search over fingerprint digests (Sect. 6.3.1), the 
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computational complexity of the method is proportional to the number of fingerprints, 
N (i.e., O(N)). Moreover, this method also provides the benefit achieved by the 
binary quantization (Sect. 6.3.2). However, it is crucial to note that the number of 
pixels used for a fingerprint digest is significantly lower than the random projection. 
While fingerprint digests typically require up to a few 10K pixels, this method 
typically requires 0.5 M pixels. 


Hierarchical Search 


A first step improvement for linear search is creating a hierarchical search schema 
where two different sets of compressed datasets are created. In the first step a coarse 
version of the compressed fingerprints mi elements (mı << m) correlated with a 
query fingerprint. This step works as a pre-filter that each coarse query fingerprint 
is correlated with the coarse versions of the N fingerprints. This step reduces the 
search space to N’ (N’ << N). In the second step, the m random projections of the 
N’ fingerprint are loaded into memory and correlated with the m random projections 
of the query fingerprint. This way, as opposed to loading m x N bits into memory, 
mı x N +m x N' bits are loaded. 

One way to do this is to pick mı random indices such as the first mı of them. 
However, this method is not robust enough to achieve any improvement over the 
linear search presented above. An alternative method is to pick indices with the 
largest magnitude similar to fingerprint digests (Goljan et al. 2010). 


6.3.6 Preprocessing, Quantization, Coding 


Another method that is proposed to compress PRNU fingerprints is through prepro- 
cessing, quantization, and finally entropy coding (Bondi et al. 2018). The prepro- 
cessing step consists of decimation and random projection. In this step, the number 
of fingerprint elements is reduced. The quantization step aims to reduce the num- 
ber of bits per fingerprint element. Finally, entropy coding is another step to further 
decrease the number of elements, which is similar to Huffman coding. 


Preprocessing Step 


The first step in this method is decimation. Given a fingerprint K with r rows, and 
c columns, this step decimates the fingerprint over d times. The columns are grouped 
by d elements to create [7]. Then the same operation is done along the rows. Thus, 
the resolution of the output fingerprint becomes [>] x [4]. 

For decimation, the authors propose to use a 1 x 5 bicubical interpolation as 


1.5|x|3 — 2.5|x|? + 1, if |x| <1 
Ka = 4 —0.5|x[? + 2.5|x/? — 4|x| +2, ifl < |x| <2 (6.13) 


0, otherwise 
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Kz is flattened to create a column vector. Then, Random Projection (RP) is applied 
to K4 to produce r; which contains P elements. 


Dead-Zone Quantization 


Although binary quantization is shown to be an effective method to reduce bitrate, 
the authors present a slightly different approach to divide to quantize rž. Instead of 
checking the sign of a fingerprint element, the authors divide the value range into 
3 ranges, and depending on where an element lies, its value is assigned as one of 
{0, 1, —1} as follows: 


+1, ifr * (i) > oó 
r G)= 10, if —cd <rx(i) <06 (6.14) 
—1, ifr * (i) < —o6 


where ø is the standard deviation of r% and ó is a factor for determining sparsity of 
the vector. Then +1’s are assigned to 10 and —1’s to 11 to create a bit-stream, r$ 
from rọ. 


Entropy Coding 


The last step of this pipeline is to apply an arithmetic entropy coding to the bit- 
stream rŠ. Here ó is a tunable variable that by increasing it, more sparse vector can 
be created. 


6.4 Decreasing the Number of Comparisons 


The methodologies presented in the previous section aim to speed up pairwise com- 
parisons. Although they can significantly decrease the cost of a fingerprint search, 
this improvement is not sufficient given the sheer number of cameras available. Often, 
a large database can contain many millions of cameras, and alternative methods may 
be required to carry over search in large datasets. This section presents an alterna- 
tive methodology that improves searching in large databases by reducing pairwise 
comparisons. 


6.4.1 Clustering by Cameras 


An obvious way to reduce the search space is by clustering camera fingerprints by 
their brands or models. A query fingerprint can be compared with only the camera 
fingerprints from the same brand/model. 

As presented in Chapter 4, a fingerprint dataset, D = (K i, K2, ... Ky} can be 
divided into smaller groups as D = (Di, D2, ... Dm} where D; is a set that consists 
of a single camera model. 
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Fig.6.1 a Composite Fingerprint-based Search Tree (CFST): query is matched against the tree of 
composite fingerprints with fingerprints at the leaves 


This way, a query fingerprint from the same camera model as D; can be searched 
within that group and a speedup of A can be achieved where |d| is the length of 
the list d. 

One of the main strengths of this approach is that it can be combined with any 
other search methodology presented in this chapter. 


6.4.2 Composite Fingerprints 


Bayram et al. (2012) proposed a group testing approach to organize large databases 
and utilize it to decrease the number of comparisons when searching for the source 
of a camera fingerprint (or an image). In group testing methods, multiple objects are 
combined to create a composite object. When a composite object is compared with 
a query object, a negative decision indicates that none of the objects in the group 
matches with the query object. On the other hand, a positive decision indicates one 
or more objects in the composite may match with the query object. 


Building a Composite Tree 


The proposed method is proposed to generate composite fingerprints, which creates 
unordered binary trees. A leaf node on these trees is a single fingerprint, whereas the 
internal nodes are the composition of their descendant nodes. 

Suppose that there are N fingerprints, (Fi, F>... Fy} which will be used to gen- 
erate a composite tree as in Fig. 6.1. The composite tree in the figure is formed of 4 
fingerprints (i.e., (Fi, F2, F3, F4}), and the composite fingerprints are defined as 


1 N 
Cin = — > Fj (6.15) 
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The reason for normalizing the composite fingerprint by Ir is to ensure the 
composite has unit variance. 
Notice that one of the main drawbacks of this approach is that it doubles the 
storage usage (i.e., for N fingerprints, a composite tree contains 2 * N — 1). 


Source Identification 


Suppose that a database D consists of N fingerprints of length n, i.e., D = 
{F\, Fy... Fy}. Now consider a query fingerprint, F} whose source is under ques- 
tioning. To determine whether F} is captured by any camera in the composite tree 
formed of (F, Fy... Fy}, F4 and C}.y are correlated. 

Given a preset threshold, 7, the correlation of F} and Cı:y can result in the 
following: 


e Hj: corr(Fy, Ci:.N) < T: none of the fingerprints in Cı: match with F; 
e Hy: corr(Fy, Ci:.N) > T, one or more fingerprints in Ci. may match with Fy. 


The decision of the preset threshold, r depends on the number of fingerprints in 
a composite (i.e., N in this case). 

If the decision of the correlation with Cj. is positive, Fç is correlated with its 
children (i.e., Cj:v/2 and Cy/2+1:). This process is recursively done until a match 
is found or all pairwise comparisons are found to be negative. 


6.5 Hybrid Methods 


In this section, other methodologies combining multiple of the previous methods. 
These methods improve ... 


6.5.1 Search over Composite-Digest Search Tree 


In this work, Taspinar et al. (2017) leverages the individual strengths of the search 
over fingerprint digest (Sect. 6.3.1) and composite fingerprint (Sect. 6.4) to make the 
search on large databases more efficient. 

In this method, two-level composite-digest search trees are created as in Fig. 6.2. 
Each root node is acomposite fingerprint which is the mean of the original fingerprints 
in the tree, and each leaf node is a digested version of individual fingerprints. 

Similar to Sect. 6.4, as the number of fingerprints increases in a composite finger- 
print, the share of each fingerprint will decrease, which will result in a less reliable 
decision. Therefore, creating a composite from so many fingerprints will not be a 
viable option. Because this will lead to more incorrect decisions (i.e., lower true pos- 
itive and higher false positive rates). Therefore, a more viable approach is to divide 
the large dataset into smaller subgroups. 
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Fig. 6.2 Composite-Digest Search Tree (CDST): one-level CFST with fingerprint digests at the 
leaves instead of fingerprints 


Fig. 6.3 Full Digest Search Tree (FDST): one-level CFST comprised of composite digests and 
fingerprint digests 


The search stage for this approach is performed in two steps. In the first stage, a 
query fingerprint is compared with composite fingerprints. If the decision is negative, 
it indicates that the entire subset doesn’t contain the source camera of the query image. 
Otherwise, a linear search is done over the fingerprint digests (leaf nodes) as opposed 
to the linear full-length versions of the fingerprints. 


6.5.2 Search over Full Digest Search Tree 


To further improve the efficiency of the previous method, Taspinar et al. present the 
search over a full digest search tree where the tree structure and search method are 
the same except that composite fingerprint are also digested as in Fig. 6.3 (Taspinar 
et al. 2017). 

In this method, composite fingerprints contain more fingerprint elements than 
individual fingerprint digests. The reason for this is a composite fingerprint obtained 
from n fingerprints is contaminated by n — 1 other fingerprints when it is correlated 
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by a fingerprint obtained from one of its child cameras. Hence, typically a higher 
number of fingerprint elements are used for composite fingerprint digests. 


6.6 Conclusion 


This chapter presented source camera identification on large databases, which is a 
crucial task when a crime image or video such as child pornography is found, and 
there is no suspect camera available. To tackle this problem, visual objects can be 
collected from social media, and a fingerprint estimate can be extracted from each 
camera in the set. 

Assuming that there is no geometric transformation on visual objects, or the 
transformations are reverted through a registration step, this chapter focused on 
determining if one of the cameras in the dataset has captured the query visual objects. 

The methods are grouped under three main categories: reducing the complexity 
of pairwise fingerprint correlation, decreasing the number of correlations, and hybrid 
methods that use two methods to achieve both of the ways of speeding up. 

Various techniques have been presented in this chapter that can help speed up the 
search. 
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Chapter 7 A) 
Source Camera Model Identification Gregi for 


Sara Mandelli, Nicolò Bonettini, and Paolo Bestagini 


Every camera model acquires images in a slightly different way. This may be due 
to differences in lenses and sensors. Alternatively, it may be due to the way each 
vendor applies characteristic image processing operations, from white balancing to 
compression. For this reason, images captured with the same camera model present 
a common series of artifacts that enable to distinguish them from other images. In 
this chapter, we focus on source camera model identification through pixel analysis. 
Solving the source camera model identification problem consists in identifying which 
camera model has been used to shoot the picture under analysis. More specifically, 
assuming that the picture has been digitally acquired with a camera, the goal is to 
identify the brand and model of the device used at acquisition time without the need 
to rely on metadata. Being able to attribute an image to the generating camera model 
may help forensic investigators to pinpoint the original creator of images distributed 
online, as well as to solve copyright infringement cases. For this reason, the forensics 
community has developed a wide series of methodologies to solve this problem. 


7.1 Introduction 


Given a digital image under analysis, we may ask ourselves a wide series of different 
questions about its origin. We may be interested in knowing whether the picture has 
been downloaded from a certain website. We may want to know which technology is 
used to digitalize the image (e.g., if it comes from a camera equipped with a rolling 
shutter or a scanner relying on a linear sensor). We may be curious about the model of 
camera used to shot the picture. Alternatively, we may need very specific details about 
the precise device instance that was used at image inception time. Despite all of these 
questions are related to source image attribution, they are very different in nature. 
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Therefore, answering all of them is far from being an easy task. For this reason, the 
multimedia forensics community typically tackles one of these problems at a time. 

In this chapter, we are interested in detecting the camera model that is used to 
acquire an image under analysis. Identifying the camera model which is used to 
acquire a photograph is possible thanks to the many peculiar traces left on the image 
at shooting time. To better understand which are the traces we are referring to, in 
this section we provide the reader with some background on the standard image 
acquisition pipeline. Finally, we provide the formal definition of the camera model 
identification problem considered in this chapter. 


7.1.1 Image Acquisition Pipeline 


We are used to shoot photographs everyday with our smartphones and cameras in a 
glimpse of an eye. Fractions of seconds pass from the moment we trigger the shutter 
to the moment we visualize the shot that we took. However, in this tiny amount of 
time, the camera performs a huge amount of operations. 

The digital image acquisition pipeline is not unique, and may differ depending on 
the vendor, the device model and the available on-board technologies. However, it 
is reasonable to assume that a typical digital image acquisition pipeline is composed 
of a series of common steps (Ramanath et al. 2005), as shown in Fig. 7.1. 

Light rays pass through a lens that focus them on the sensor. The sensor is typically 
a Charge-Coupled Device (CCD) or Complementary Metal-Oxide Semiconductor 
(CMOS), and can be imagined as a matrix of small elements geometrically organized 
on a plane. Each element represents a pixel, and returns a different voltage depending 
on the intensity of the light that hits it. Therefore, the higher the amount of captured 
light, the higher the output voltage, and the brighter the pixel value. 

As these sensors react to light intensity, different strategies to capture color infor- 
mation may be applied. If multiple CCD or CMOS sensors are available, prisms can 
be used to split the light into different color components (typically red, green, and 
blue) that are directed to the different sensors. In this way, each sensor captures the 
intensity of a given color component, thus a color image can be readily obtained com- 
bining the output of each sensor. However, multiple sensors are typically available 
only on high-end devices, making this pipeline quite uncommon. 

A more customary way of capturing color images consists in making use of a 
Color Filter Array (CFA) (or Bayer filter). This is a thin array of color filters placed 
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Fig. 7.1 Typical steps of a common image acquisition pipeline 
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on top of the sensor. Due to these filters, each sensor's element is hit only by light 
in a narrow wavelength band corresponding to a specific color (typically red, green 
or blue). This means that the sensor returns the intensity of green light for certain 
pixels, the intensity of blue light for other pixels, and the intensity of red light for 
the remaining pixels. Which pixels capture which color depends on the shape of 
the CFA, which is a vendor choice. At this point, the output of the sensor consists 
of three partially sampled color layers in which only one color value is recorded 
at each pixel location. Missing color information, like the blue and red compo- 
nents for pixels that only acquired green light, are retrieved via interpolation from 
neighboring cells with the available color components. This procedure is known as 
debayering or demosaicing, and can be implemented using proprietary interpolation 
techniques. 

After this raw version of a color image is obtained, a list of additional operations 
are in order. For instance, as lenses may introduce some kinds of optical distortion, 
most notably barrel distortion, pincushion distortion or combinations of them, it is 
common to apply some digital correction that may introduce forensic traces. Addi- 
tionally, white balancing and color correction are other operations that are often 
applied and may be vendor-specific. Finally, lossy image compression is typically 
applied by means of JPEG standard, which again may vary with respect to compres- 
sion quality and vendor-specific implementation choices. 

Since a few years ago, these processing steps were the main sources of camera 
model artifacts. However, with the rapid proliferation of computational photography 
techniques, modern devices implement additional custom functionalities. This is the 
case of bokeh images (also known as portrait images) synthetically obtained through 
processing. These are pictures in which the background is digitally blurred with 
respect to the foreground object to obtain an artistic effect. Moreover, many devices 
implement the possibility of shooting High Dynamic Range (HDR) images, which 
are obtained by combining multiple exposures into a single one. Additionally, several 
smartphones are equipped with multiple cameras and produce pictures by mixing 
their outputs. Finally, many vendors introduce the possibility of shooting photographs 
with special filter effects that may enhance the picture in different artistic ways. All 
of these operations are custom and add traces to the pool of artifacts that can be 
exploited as a powerful asset for forensic analysis. 


7.1.2 Problem Formulation 


The problem of source camera model identification consists in detecting the model 
of the device used to capture an image. Although the definition of this problem 
seems pretty straightforward, depending on the working hypothesis and constraints, 
the problem may be cast in different ways. For instance, the analyst may only have 
access to a finite set of possible camera models. Alternatively, the forensic investiga- 
tor may want to avoid using metadata. In the following, we provide the two camera 
model identification problem formulations that we consider in the rest of the chapter. 
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Fig.7.2 Representation of closed-set (a) and open-set (b) camera model identification problem 


In both formulations, we only focus on pixel-based analysis, i.e., we do not consider 
the possibility of relying on metadata information. 


Closed-set Identification 

Closed-set camera model identification refers to the problem of detecting which is 
the camera model used to shoot a picture within a set of known devices, as shown in 
Fig. 7.2a. In this scenario, the investigator assumes that the image under analysis has 
been taken with a device within a family of devices she/he is aware of. If the image 
does not come from any of those devices, the investigator will wrongly attribute the 
image to one of those known devices, no matter what. 

Formally, let us define the set of labels of known camera models as 7%. Moreover, 
let us define I as a color image acquired by the device characterized with label t € 7%. 
The goal of closed-set camera model identification is to provide an estimate £ of t 
given the image I and the set 7%. Notice that £ € 7; by construction. Therefore, if the 
hypothesis that t € Z; does not hold in practice, this approach is not applicable, as 
the condition £ € J; would imply that £ Z t. 


Open-set Identification 

In many scenarios, it is not realistic to assume that the analyst has full control over 
the complete set of devices that may have been used to acquire the digital image 
under analysis. In this case, it is better to resort to the open-set problem formulation. 
The goal of open-set camera model identification is twofold, as shown in Fig. 7.2b. 
Given an image under analysis, the analyst aims 


e To detect if the image comes from the set of known camera models or not. 
e To detect the specific camera model, if the image comes from a known model. 


This is basically a generalization of the closed-set formulation that accommodates 
for the analysis of images coming from unknown devices. 

Formally, let us define the set of labels of known models as Z,, and the set of labels 
of unknown models (i.e., all the other existing camera models) as 7,,. Moreover, let us 
define I as a color image acquired by the device characterized with label t € Z, UJ Tu. 
The goal of open-set camera model identification is 
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e To estimate whether t € J or t € Ty. 
e To provide an estimate f of t incase t € 74. In this case also £ € Tp. 


Despite this problem formulation looks more realistic than its closed-set counterpart, 
the vast majority of camera model literature only copes with the closed-set problem. 
This is mainly due to the difficulty in well modeling the unknown set of models. 


7.2 Model-Based Approaches 


As mentioned in Sect. 7.1.1, multiple operations performed during image acquisition 
may be characteristic of a specific camera brand and model. This section provides an 
overview of camera model identification methods that work by specifically leveraging 
those traces. These methods are known as model-based methods, as they assume that 
each artifact can be modeled, and this model can be exploited to reverse engineer the 
used device. Methods that model each step of the acquisition chain are historically 
the first ones being developed in the literature (Swaminathan et al. 2009). 


7.2.1 Color Filter Array (CFA) 


As CCD/CMOS sensor elements of digital cameras are sensitive to the received light 
intensity and not to specific colors, the CFA is usually introduced in order to split the 
incoming light into three corresponding color components. Then, a three-channel 
color image can be obtained by interpolating the pixel information associated with 
the color components filtered by the CFA. There are several works in the literature 
that exploit specific characteristics associated with the CFA configuration, i.e., the 
specific arrangement of color filters in the sensor plane and the CFA interpolation 
algorithm to retrieve information about the source camera model. 


CFA Configuration 

Even without investigating the artifacts introduced by demosaicing, we can exploit 
information from the Bayer configuration to infer some model-specific features. For 
example, Takamatsu et al. (2010) developed a method to automatically identify CFA 
patterns from the distribution of image noise variances. In Kirchner (2010), the Bayer 
configuration was estimated by restoring the raw image (i.e., prior to demosaicing) 
from the output interpolated image. Authors of Cho et al. (2011) showed how to 
estimate the CFA pattern from a single image by extracting pixel statistics on 2 x 2 
image blocks. 


CFA Interpolation 
In an image acquisition pipeline, the demosaicing step inevitably injects some inter- 
pixel correlations into the interpolated image. Existing demosaicing algorithms differ 
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in the size of their support region, in the way they select the pixel neighborhood, and 
in their assumptions about the image content and adaptability to this Gunturk et al. 
(2005), Menon and Calvagno (2011). Over the years, the forensics community has 
largely exploited these interpolation traces to discriminate among different camera 
models. 

The first solution dates back to 2005 (Popescu and Farid 2005). The authors esti- 
mated the inter-pixel correlation weights from a small neighborhood around a given 
pixel, treating each color channel independently. They also observed that different 
interpolation algorithms present different correlation weights. Similarly, authors of 
Bayram et al. (2005) exploited the correlation weights estimated as shown in Popescu 
and Farid (2005) and combined them with frequency domain features. In Bayram 
et al. (2006), the authors improved upon their previous results by treating smooth 
and textured image regions differently. Their choice was motivated by the different 
treatment of distinct demosaicing implementations on high-contrast regions. 

In 2007, Swaminathan et al. (2007) estimated the demosaicing weights only on 
interpolated pixels, i.e., without accounting for pixels which are relatively invariant to 
the CFA interpolation process. Authors found the CFA pattern by fitting linear filter- 
ing models and selecting the one which minimized the interpolation error. As done in 
Bayram et al. (2006), they considered three diverse estimations according to the local 
texture of the images. Interestingly, the authors’ results pointed out some similarity 
in interpolation patterns among camera models from the same manufacturer. 

For the first time, Cao and Kot (2009) did not explore each color channel indepen- 
dently, but instead exposed cross-channel pixel correlations caused by demosaicing. 
Authors reported that many state-of-the-art algorithms often employ color difference 
or hue domains for demosaicing, and this inevitably injects a strong correlation across 
image channels. In this vein, authors extended the work by Swaminathan et al. (2007) 
by estimating the CFA interpolation algorithm using a partial second-order derivative 
model to detect both intra-channel and cross-channel demosaicing correlations. They 
estimated these correlations by grouping pixels into 16 categories based on their CFA 
positions. In 2010, the same authors tackled the camera model identification problem 
on mobile phone devices, showing how CFA interpolation artifacts enable to achieve 
excellent classification results on dissimilar models, while confusing cameras of the 
same or very similar models (Cao and Kot 2010). 

In line with Cao and Kot (2009), also Ho et al. (2010) exploited cross-channel 
interpolation traces. The authors measured the inter-channel differences of red and 
blue colors with respect to the green channel, and converted these into the frequency 
domain to estimate pixel correlations. They were motivated by the fact that many 
demosaicing algorithms interpolate color differences instead of color channels (Gun- 
turk et al. 2005). Similarly, Gao et al. (2011) worked with variances of cross-channel 
differences. 

Differently from previous solutions, in 2015, Chen and Stamm (2015) proposed a 
new framework to deal with camera model identification, inspired to the rich models 
proposed by Fridrich and Kodovsky (2012) for steganalysis. Authors pointed out 
that previous solutions were limited by their essential use of linear or local linear 
parametric models, while modern demosaicing algorithms are both non-linear and 
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adaptive (Chen and Stamm 2015). The authors proposed to build a rich model of 
demosaicing by grouping together a set of non-parametric submodels, each capturing 
specific partial information on the interpolation algorithm. The resulting rich model 
could return a much more comprehensive representation of demosaicing compared 
to previous strategies. 

Authors of Zhao and Stamm (2016) focused on controlling the computation cost 
associated with the solution proposed in Swaminathan et al. (2007), which relied on 
least squares estimation and could become impractical when a consistent number 
of pixels was employed. The authors proposed an algorithm to find the pixel set 
that yields the best estimate of the parametric model, still keeping the computational 
complexity feasible. Results showed that the proposed method could achieve higher 
camera model classification accuracy than Swaminathan et al. (2007), at a fixed 
computational cost. 


7.2.2 Lens Effects 


Every digital camera includes a sophisticated optical system to project the acquired 
light intensity on small CCD or CMOS sensors. Projection from the lens on to the 
sensor inevitably injects some distortions in the acquired image, usually known as 
optical aberrations. Among them, the most common optical aberrations are radial 
lens distortion, chromatic aberration, and vignetting. The interesting point from a 
forensics perspective is that different camera models use different optical systems, 
thus they reasonably introduce different distortions during image acquisition. For this 
reason, the forensics literature has widely exploited optical aberration as a model- 
based trace. 


Radial Lens Distortion 

Radial lens distortion is due to the fact that lenses in consumer cameras usually 
cannot magnify all the acquired regions with a constant magnification factor. Thus, 
different focal lengths and magnifications appear in different areas. This lens imper- 
fection causes radial lens distortion which is a non-linear optical aberration that 
renders straight lines in the real image as curved lines on the sensor. Barrel distortion 
and pincushion distortion are the two main distortion forms we usually find in digital 
images. In San Choi et al. (2006), the authors measured the level of distortion of an 
image and used the distortion parameters as a feature to discriminate among different 
source camera models. Additionally, they investigated the impact of optical zoom on 
the reliability of the method, noticing that a classification beyond mid-range focal 
lengths can be problematic, due to the vanishing of distortion artifacts. 


Chromatic Aberration 

Chromatic aberration stems from wavelength-dependent variations of the refractive 
index of a lens. This phenomenon causes a spread of the color components over the 
sensor plane. The consequence of chromatic aberration is that color fringes appear 
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in high-contrast regions. We can identify axial chromatic aberration and lateral chro- 
matic aberration. The former accounts for the variations of the focal point along the 
optical axis; the latter indicates the relative displacement of different light compo- 
nents along the sensor plane. 

In 2007, Lanh et al. (2007) applied the model proposed in Johnson and Farid (2006) 
to estimate the lateral chromatic aberration using small patches extracted from the 
image center. Then, they fed the estimated parameters to a Support Vector Machine 
(SVM) classifier for identifying the source camera model. Later, Gloe et al. (2010) 
proposed to reduce the computational cost of the previous solution by locally estimat- 
ing the chromatic aberration. The authors also pointed out some issues due to non- 
linear phenomena occurring in modern lenses. Other studies were carried on in Yu 
et al. (2011), where authors pointed out a previously overlooked interaction between 
chromatic aberration and focal distance of lenses. Authors were able to obtain a 
stable chromatic aberration pattern distinguishing different copies of the same lens. 


Vignetting 

Vignetting is the phenomenon of light intensity fall-off around the corners of an 
image with respect to the image center. Usually, wide-aperture lenses are more prone 
to vignetting, because fewer light rays reach the sensor’s edges. 

The authors of Lyu (2010) estimated the vignetting pattern from images adopt- 
ing a generalization of the vignetting model proposed in Kang and Weiss (2000). 
They exploited statistical properties of natural images in the derivative domain to 
perform a maximum likelihood estimation. The proposed method was tested among 
synthetically generated and real vignetting, showing far better results for lens model 
identification on synthetic data. The lower accuracy on real scenarios can be due to 
difficulties in correctly estimating the vignetting pattern whenever highly textured 
images are considered. 


7.2.3 Other Processing and Defects 


The traces left by the CFA configuration, the demosaicing algorithm, and the optical 
aberrations due to lens defects are only a few of the footprints that forensic ana- 
lysts can exploit to infer camera model-related features. We report here some other 
model-based processing operations and defects that carry information about the cam- 
era model. 


Sensor Dust 

In 2008, Dirik et al. (2008) exploited the traces left by dust particles on the sensor 
of digital single-lens reflex cameras to perform source camera identification. The 
authors estimated the dust pattern (i.e., its location and shape) from the camera or 
from a number of images shot by the camera. The proposed methodology was robust 
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to both JPEG compression and downsizing operations. However, the authors pointed 
out some Issues in correctly estimating the dust pattern for complex and non-smooth 
images, especially for low focal length values. 


Noise Model 
Modeling the noise pattern of acquired images can represent a valuable feature to 
distinguish among cameras of the same or different models. 

Here, Photo Response Non Uniformity (PRNU) is arguably one of the most influ- 
ential contributions to multimedia forensics. PRNU is a multiplicative noise pattern 
that occurs in any sensor-recorded digital image, due to imperfections in the sensor 
manufacturing process. In Lukas et al. (2006), Chen et al. (2008) the authors pro- 
vided a complete modeling of the digital image at the sensor output as a function 
of the incoming light intensity and noise components. They discovered the PRNU 
noise to represent a powerful and robust device-related fingerprint, able to uniquely 
identify different devices of the same model. Since this chapter deals with model- 
level identification granularity, we do not further analyze the potential of PRNU for 
device identification. 

Contrarily to PRNU-based methods, which model only the multiplicative noise 
term due to sensor imperfections, other methods focused on modeling the entire 
noise corrupting the digital image and exploited this noise to tackle the camera model 
identification task. For instance, Thai et al. (2014) worked with raw images at the 
sensor output. The authors modeled the complete noise contribution of natural raw 
images (known as the heteroscedastic noise Foi et al. 2009) with only two parameters, 
and used it as a fingerprint to discriminate camera models. Contrarily to PRNU noise, 
heteroscedastic noise could not separate different devices of the same camera model, 
thus it is more appropriate for model-level identification granularity. 

The heteroscedastic noise consists of a Poisson-distributed component which 
accounts for the photon shot noise and dark current, and a Gaussian-distributed term 
which addresses other stationary noise sources like read-out noise (Foi et al. 2009). 
The proposed approach in Thai et al. (2014) involved the estimation of the two 
characteristic noise parameters per camera model considering 50 images shot by 
the same model. However, the method presented some limitations: first, it requires 
raw image format without post-processing or compression operations; second, non- 
linear processes like gamma correction modify the heteroscedastic noise model; 
finally, changes in ISO sensitivity and demosaicing operations worsen the detector 
performances. 

In 2016, the authors improved upon their previous solution, solving the cam- 
era model identification from TIFF and JPEG compressed images and taking into 
account the non-linear effect of gamma correction (Thai et al. 2016). They proposed a 
generalized noise model starting from the previously exploited heteroscedastic noise 
but including also the effects of subsequent processing and compression steps. This 
way, each camera model could be described by three parameters. The authors tested 
their methodology over 18 camera models. 
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7.3 Data-Driven Approaches 


Contrarily to model-based methods presented in Sect. 7.2, in the last few years we 
have seen the widespread adoption of data-driven approaches to camera model iden- 
tification. All solutions that extract knowledge and insights from data without explic- 
itly modeling the data behavior using statistical models fall into this category. These 
data-driven approaches have greatly outperformed multiple model-based solutions 
proposed in the past. While model-based solutions usually focus on a specific com- 
ponent of the image acquisition pipeline, data-driven approaches can capture model 
traces left by the interplay among multiple components. For instance, image noise 
characteristics do not only originate from sensor properties and imperfections, but 
are the result of CFA interpolation and other internal processing operations (Kirchner 
and Gloe 2015). 
We can further divide data-driven approaches into two broad categories: 


1. Methods based on hand-crafted features, which derive properties of data by 
extracting suitable data descriptors like repeated image patterns, texture, and gra- 
dient orientations. 

2. Methods based on learned features, which directly exploit raw data, i.e., without 
extracting any descriptor, to learn distinguishable data properties and to solve the 
specific task. 


7.3.1 Hand-Crafted Features 


Over the years, the multimedia forensics community has often imported image 
descriptors from other research domains to solve the camera model identification 
problem. For instance, descriptors developed for steganalysis and image classifica- 
tion have been widely used to infer forensic traces on images. The reason behind 
their application in the forensics field is that, in general, these descriptors enable an 
effective representation of images and contain relevant information to distinguish 
among different image sources. 

Surprisingly, the very first proposed solution to solve the camera model identifica- 
tion problem in a blind setup exploited hand-crafted features (Kharrazi et al. 2004). 
The authors extracted descriptors about color channel distributions, wavelet statistics 
and image quality metrics, then they fed these to an SVM classifier to distinguish 
among few camera models. 

In the following, we illustrate the existing solutions based on hand-crafted feature 
extraction. 


Local Binary Patterns 

Local binary patterns (Ojala et al. 2002) represent a good example of a general 
image descriptor that can capture various image characteristics. In a nutshell, local 
binary patterns can be computed as the difference between a central pixel and its 


7 Source Camera Model Identification 143 


local neighborhood, binary quantized and coded to produce an histogram carrying 
information about the local inter-pixel relations. 

In 2012, the authors of Xu and Shi (2012) were inspired by the idea that the entire 
image acquisition pipeline could generate localized artifacts in the final image, and 
these characteristics could be captured by the uniform grayscale invariant local binary 
patterns (Ojala et al. 2002). The authors extracted features for red and green color 
channel in the spatial and wavelet domains, resulting in a 354-dimension feature per 
image. As commonly done in the majority of works previous to the wide adoption 
of neural networks, the authors fed these features to an SVM classifier to classify 
image sources among 18 camera models. 


DCT Domain Features 

Since the vast majority of cameras automatically stores the acquired images in JPEG 
format, many forensics works approach camera model identification by exploiting 
some JPEG-related features. In particular, many researches focused on extracting 
model-related features in the Discrete Cosine Transform (DCT) domain. 

In 2009, Xu et al. (2009) extracted the absolute value of the quantized 8 x 8 
DCT coefficient blocks, then computed the difference between the element blocks 
along four directions to look for inter-coefficient correlations. They fed these to 
Markov transition probability matrices to identify statistical differences inside images 
of distinct camera models. The elements of transition probability matrices were 
thresholded and fed as hand-crafted features to an SVM classifier. 

DCT domain features have been exploited also in Wahab et al. (2012), where 
the authors extracted some conditional probability features from the absolute values 
of three 8 x 8 DCT coefficient blocks. From these blocks, they only used the 4 x 
4 upper left frequencies which demonstrated to be most significant for the model 
identification task. The authors fed an SVM classifier to distinguish among 4 camera 
models. 

More recently, the authors of Bonettini et al. (2018) have shown that JPEG eigen- 
algorithm features can be used for camera model identification as well. These features 
capture the differences among DCT coefficients obtained after multiple JPEG com- 
pression steps. A standard random forest classifier is used for the task. 


Color Features 

Camera model identification can be faced also by looking at some model-specific 
image color features. Such artifacts can be found according to the specific properties 
of the used CFA and color correction algorithms. 

As previously reported, in 2004, Kharrazi et al. (2004) assumed that certain traits 
and patterns should exist between the RGB bands of an image, regardless of the 
scene content depicted. Therefore, they extracted 21 image features which aimed at 
capturing cross-channel correlations and energy ratios together with intra-channel 
average value, pixel neighbor distributions and Wavelet statistics. In 2009, this set 
of features was enlarged with 6 additional color features by Gloe et al. (2009), 
which included the dependencies between the average values of the color channels. 
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Specifically, the authors extended the feature set by computing the norms of the dif- 
ference between the original and the white-point corrected image. 


Image Quality Metrics Features 

In 2002, Avcibas et al. (2002) proposed to extract some image quality metrics to infer 
statistical properties from images. This feature set included measures based on pixel 
differences, correlations, edges, spectral content, context, and human visual system. 
The authors proposed to exploit these features for steganalysis applications. 

In 2004, Kharrazi et al. (2004) employed a subset of these metrics to deal with the 
camera model identification task. The authors motivated their choice by noticing that 
different cameras produce images of different quality, in terms of sharpness, light 
intensity and color quality. They computed 13 image quality metrics, dividing them 
into three main categories: those based on pixel differences, those based on correla- 
tions and those based on spectral distances. While Kharrazi et al. (2004) averaged 
these metrics over the three color channels, Celiktutan et al. (2008) evaluated each 
color band separately resulting in 40 features. 


Wavelet Domain Features 

Wavelet-domain features are known to be effective to analyze the information content 
of images and noise components, enabling to capture interesting insights on their 
statistical properties for what concerns different resolutions, orientations, and spatial 
positions (Mallat 1989). 

These features have been exploited as well to solve the camera model identi- 
fication task. For instance, Kharrazi et al. (2004) exploited 9 Wavelet-based color 
features. Later on, Celiktutan et al. (2008) added 72 more features by evaluating the 
four moments of the Wavelet coefficients over multiple sub-bands, and by predicting 
sub-band coefficients from their neighborhood. 


Binary Similarity Features 

Like image quality metrics, also binary similarity features were first proposed to 
tackle steganalysis problems and then adopted in forensics applications. For instance, 
Avcibas et al. (2005) investigated statistical features extracted from different bit 
planes of digital images, measuring their correlations and their binary texture char- 
acteristics. 

Binary similarities were exploited in 2008 by Celiktutan et al. (2008), which con- 
sidered the characteristics of the neighborhood bit patterns among the less significant 
bit planes. The authors investigated histograms of local binary patterns by account- 
ing for occurrences within a single bit plane, across different bit planes and across 
different color channels. They resulted in a feature vector of 480 elements. 


Co-occurrence Based Local Features 

Together with binary similarity features and image quality metrics, also rich models, 
or co-occurrence-based local features (Fridrich and Kodovsky 2012), were imported 
from steganalysis to forensics. Three main steps compose the extraction of co- 
occurrences of local features (Fridrich and Kodovsky 2012): (i) the computation 


7 Source Camera Model Identification 145 


of image residuals via high-pass filtering operations; (ii) the quantization and trun- 
cation of residuals; (iii) the computation of the histogram of co-occurrences. 

In this vein, Chen and Stamm (2015) built a rich model of a camera demosaicing 
algorithm inspired by Fridrich and Kodovsky (2012) to perform camera model iden- 
tification. The authors generated a set of diverse submodels so that every submodel 
conveyed different aspects of the demosaicing information. Every submodel was built 
using a particular baseline demosaicing algorithm and computing co-occurrences 
over a particular geometric structure. By merging all submodels together, the authors 
obtained a feature vector of 1,372 elements to feed an ensemble classifier. 

In 2017, Marra et al. (2017) extracted a feature vector of 625 co-occurrences 
from each color channel and concatenated the three color bands, resulting in 1,875 
elements. The authors investigated a wide range of scenarios that might occur in 
real-world forensic applications, considering both the closed-set and the open-set 
camera model identification problems. 


Methods with Several Features 

In many state-of-the-art works, several different kinds of features were extracted 
from images instead of just one. The motivation lied in the fact that merging multiple 
features could better help finding model-related characteristics, thus improving the 
source identification performance. 

For example, (as already reported) Kharrazi et al. (2004) combined color-related 
features with Wavelet statistics and image quality metrics, ending with 34-element 
features. Similarly, Gloe (2012) used the same kinds of features but extended the 
set up to 82 elements. On the other hand, Celiktutan et al. (2008) exploited image 
quality metrics, Wavelet statistics, and binary similarity measures, resulting in a total 
of 592 features. 

Later on, in 2018, Sameer and Naskar (2018) investigated the dependency of 
source classification performance with respect to illumination conditions. To infer 
model-specific traces, they extracted image quality metrics and Wavelet features 
from images. 


Hand-crafted Features in Open-set Problems 

The open-set identification problem started to be investigated in 2012 by Gloe (2012), 
which proposed two preliminary solutions based on the extraction of 82-element 
hand-crafted features from images. The first approach was based on a one-class 
SVM, trained for each available camera model. The second proposal trained a binary 
SVM in one-versus-one fashion, for all pairs of known camera models. However, 
the achieved results were unconvincing and revealed practical difficulties to han- 
dle unknown models. Nonetheless, this represented a very first attempt to handle 
unknown source models and highlighted the need of further investigations on the 
topic. 

Later in 2017, Marra et al. (2017) explored the open-set scenario considering 
two case studies: (i) the limited knowledge situation, in which only images from one 
camera model were available; (ii) the zero knowledge situation, in which authors had 
no clue on the model used to collect images. In the former case, the authors aimed at 
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detecting whether images came from the known model or from an unknown one; in 
the latter case, the goal was to retrieve a similarity among images shot by the same 
model. 

To solve the first issue, the authors followed a similar approach to Gloe (2012) by 
training a one-class SVM on the co-occurrence based local features extracted from 
known images. They solve the second task by looking for K nearest neighbor features 
in the testing image dataset using a kd-tree-based search algorithm. The performed 
experiments reported acceptable results but definitively underlined the urgency of 
new strategies to handle realistic open-set scenarios. 


7.3.2 Learned Features 


We have recently assisted in the rapid rise of solutions based on learned forensic 
features which have quickly replaced hand-crafted forensic feature extraction. For 
instance, considering the camera model identification task, we can directly feed 
digital images to a deep learning paradigm in order to learn model-related features to 
associate Images with their original source. Actually, we can differentiate between 
two kinds of learning frameworks: 


1. Two-step sequential learning frameworks, which first learn significant features 
from data and successively learn how to classify data according to the extracted 
features. 

2. End-to-end learning frameworks, which directly classify data by learning some 
features in the process. 


Among deep learning frameworks, Convolutional Neural Networks (CNNs) are 
now the widespread solution to face several multimedia forensics tasks. In the fol- 
lowing, we report state-of-the-art solutions based on learned features which tackle 
the camera model identification task. 


Two-step Sequential Learning Frameworks 

One of the first contributions to camera model identification with learned features 
was proposed in 2017 by Bondi et al. (2017a). The authors proposed a data-driven 
approach based on CNNs to learn model-specific data features. The proposed method- 
ology was based on two steps: first, training a CNN with image color patches of 
64 x 64 pixels to learn a feature vector of 128 elements per patch; second, training 
an SVM classifier to distinguish among 18 camera models from the Dresden Image 
Database (Gloe and Böhme 2010). In their experimental setup, the authors outper- 
formed state-of-the-art methods based on hand-crafted features (Marra et al. 2017; 
Chen and Stamm 2015). 

In the same year, Bayar and Stamm (2017) proposed a CNN-based approach 
robust to resampling and recompression artifacts. In their paper, the authors proposed 
an end-to-end learning approach (completely based on CNNs) as well as a two- 
step learning approach. In the latter scenario, they extracted a set of deep features 
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from the penultimate CNN layer and fed an Extremely Randomized Trees (ET) 
classifier, purposely trained for camera model identification. The method was tested 
on 256 x 256 image patches (retaining only the green color channel) from 26 camera 
models of the Dresden Image Database (Gloe and Böhme 2010). To mimic realistic 
scenarios, JPEG compression and resampling were applied to the images. The two- 
step approach returned slightly better performances than the end-to-end approach. 

The same authors also started to investigate open-set camera model identification 
using deep learning (Bayar and Stamm 2018). They compared different two-step 
approaches based on diverse classifiers, showing that the ET classifier outperformed 
the other options. 

Meanwhile, Ferreira et al. (2018) proposed a two-step sequential learning approach 
in which the feature extraction phase was composed of two parallel CNNs, an 
Inception-ResNet (Szegedy et al. 2017) and an Xception (Chollet 2017) architecture. 
First, the authors selected 32 patches of 229 x 229 pixels per image according to an 
interest criterion. Then, they separately trained the two CNNs returning 256-element 
features each. They merged these features and passed them through a secondary shal- 
low CNN for classification. The proposed method was tested over the IEEE Signal 
Processing Cup 2018 dataset (Stamm et al. 2018; IEEE Signal Processing Cup 2018 
Database 2021). The authors considered data augmentations as well, including JPEG 
compression, resizing, and gamma correction. 

In 2019, Rafi et al. (2019) proposed another two-step pipeline. The authors first 
trained a 201-layer DenseNet CNN (Huang et al. 2017) with image patches of 
256 x 256 pixels. They proposed the DenseNet architecture since dense connec- 
tions could help in detecting minute statistical features related to the source camera 
model. Moreover, the authors augmented the training set by applying JPEG com- 
pression, gamma correction, random rotation, and empirical mode decomposition 
(Huang et al. 1998). The trained CNN was used to extract features from image 
patches of different sizes; these features were concatenated and employed to train a 
second network for classification. The proposed method was trained over the IEEE 
Signal Processing Cup 2018 dataset (Stamm et al. 2018; IEEE Signal Processing 
Cup 2018 Database 2021) but tested on the Dresden Image Database as well (Gloe 
and Böhme 2010). 

Following the preliminary analysis performed by Bayar and Stamm (2018), Júnior 
et al. (2019) proposed an in-depth study on open-set camera model identification. The 
authors pursued a two-steps learning approach, comparing several feature extraction 
algorithms and classifiers. Together with hand-crafted feature extraction methodolo- 
gies, the CNN-based approach proposed in Bondi et al. (2017a) was investigated. 
Not much surprisingly, this last approach revealed to be the best effective for feature 
extraction. 

An interesting two-step learning framework was proposed by Guera et al. (2018). 
The authors explored a CNN-based solution to estimate the reliability of a given 
image patch, i.e., its likelihood to be used for model identification. Indeed, saturated 
or dark regions might not contain sufficient information on the source model, thus 
could be discarded to improve the attribution accuracy. The authors borrowed the 
CNN architecture from Bondi et al. (2017a) as feature extractor and built a multi-layer 
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network for reliability estimation. They compared an end-to-end learning approach 
with two different sequential learning strategies, showing that two-step approaches 
largely outperformed the end-to-end solution. The authors suggested the lower per- 
formance of the end-to-end approach could be due to the limited amount of training 
data, probably not enough to learn all the CNN parameters in end-to-end fashion. 
The proposed method was tested on the same Dresden’s models employed in Bondi 
et al. (2017a). By discarding unreliable patches from the process, an improvement 
on 8% was measured on the identification accuracy. 


End-to-end Learning Frameworks 

The very first approach to camera model identification with learned features was 
proposed in 2016 by Tuama et al. (2016). Differently from the above presented works, 
the authors developed a complete end-to-end learning framework, where feature 
extraction and classification were performed in a unified framework exploiting a 
single CNN. The proposed method was quite innovative considering that the literature 
was based on feature extraction and classification steps. In their work, the authors 
proposed to extract 256 x 256 color patches from images, to compute a residual 
through high-pass filtering, and then to feed a shallow CNN architecture which 
returned a classification score for each camera model. The method was tested on 
26 models from the Dresden Image Database (Gloe and Böhme 2010) and 6 further 
camera models from a personal collection. 

In 2018, Yao et al. (2018) proposed a similar approach to Bondi et al. (2017a), but 
the authors trained a CNN using an end-to-end framework. 64 x 64 color patches 
were extracted from images, fed to a CNN and then passed through majority voting 
to classify the source camera model. The authors evaluated their methodology on 25 
models from the Dresden Image Database (Gloe and Böhme 2010), investigating also 
the effects of JPEG compression, noise addition, and image re-scaling on a reduced 
set of 5 devices. 

Another end-to-end approach was proposed in 2020 by Rafi et al. (2021), which 
put particular emphasis on the use of CNNs as pre-processing block prior to a classi- 
fication network. The pre-processing CNN was composed of a series of “RemNant” 
blocks, i.e., 3-layer convolutional blocks followed by batch normalization. The out- 
put to the pre-processing step contained residual information about the input image 
and preserved model-related traces from a wide range of spatial frequencies. The 
whole network, named “RemNet’, included the pre-processing CNN and a shal- 
low CNN for classification. The authors tested their methodology on 64 x 64 image 
patches, considering 18 models from the Dresden Image Database (Gloe and Böhme 
2010) and the IEEE Signal Processing Cup 2018 dataset (Stamm et al. 2018; IEEE 
Signal Processing Cup 2018 Database 2021). 

A very straightforward end-to-end learning approach was deployed in Mandelli 
et al. (2020), where the authors did not aim at exploring novel strategies to boost the 
achieved identification accuracy, but investigated how JPEG compression artifacts 
can affect the results. The authors compared four state-of-the-art CNN architectures 
in different training and testing scenarios related to JPEG compression. Experiments 
were computed on 512 x 512 image patches from 28 camera models of the Vision 
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Image Dataset (Shullani et al. 2017). Results showed that being careful to the JPEG 
grid alignment and to the compression quality factor of training/testing images is of 
paramount importance, independently on the chosen network architecture. 


Learned Features in Open-set Problems 

The majority of the methods presented in previous sections tackled closed-set cam- 
era model identification, leaving open-set challenges to future investigations. Indeed, 
the open-set literature still counts very few works and lacks proper algorithm com- 
parisons with respect to closed-set investigations. Among learned feature method- 
ologies, only Bayar and Stamm (2018) and Júnior et al. (2019) tackled open-set 
classification. Here, we illustrate the proposed methodologies in greater more detail. 

In 2018, Bayar and Stamm (2018) proposed two different learning frameworks 
for open-set model identification. The common paradigm behind the approaches 
was two-step sequential learning, where the feature extraction block included all 
the layers of a CNN that precede the classification layer. Following their prior work 
(Bayar and Stamm 2017), the authors trained a CNN for camera model identification 
and selected the resulting deep features associated with the second-to-last layer to 
train a classifier. The authors investigated 4 classifiers: (i) one fully connected layer 
followed by the softmax function, which is the standard choice in end-to-end learning 
frameworks; (ii) an ET classifier; (iii) an SVM-based classifier; (iv) a classifier based 
on the cosine similarity distance. 

In the first approach, the authors trained the classifier in closed-set fashion over the 
set of known camera models. In the testing phase, they thresholded the maximum 
score returned by the classifier. If that score exceeded a predefined threshold, the 
image was associated with a known camera model, otherwise it was declared to be 
of unknown provenance. 

In the second approach, known camera models were divided in two disjoint sets: 
“known-known” models and “known-unknown” models. A closed-set classifier was 
trained over the reduced set of “known-known” models, while a binary classifier was 
trained to distinguish “known-known” models from “known-unknown” models. In 
the testing phase, if the binary classifier returned the “known-unknown” class, the 
image was linked to an unknown source model, otherwise, the image was associated 
with a known model. 

In any case, when the image was determined to belong to a known model, the 
authors estimated the actual source model as in a standard closed-set framework. 

Experimental results were evaluated on 256 x 256 grayscale image patches 
selected from 10 models of the Dresden Image Database (Gloe and Böhme 2010) 
and other 15 personal devices. The authors showed that the softmax-based classifier 
performed worst among the 4 proposals. The ET classifier achieved the best perfor- 
mance for both approaches both in identifying the source model in closed-set fashion 
and in known-vs-unknown model detection. 

An in-depth study on the open-set problem has been pursued by Júnior et al. 
(2019). The authors thoroughly investigated this issue by comparing 5 feature extrac- 
tion methodologies, 3 training protocols and 12 open-set classifiers. Among feature 
extractors, they selected hand-crafted based frameworks (Chen and Stamm 2015; 
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Marra et al. 2017) together with learned features (Bondi et al. 2017a). As for training 
protocols, they considered the two strategies presented by Bayar and Stamm (2018) 
and one additional strategy which included more “known-unknown” camera mod- 
els. Several open-set classifiers were explored, ranging from different SVM-based 
approaches to one classifier proposed in the past by the same authors and to those 
previously investigated by Bayar and Stamm (2018). 

The training dataset included 18 camera models from the Dresden Image Database 
(Gloe and Böhme 2010), i.e., the same models used by Bondi et al. (2017a) in closed- 
set scenario, as known models, and the remaining Dresden’s models as “known- 
unknown” to be used for the additional proposed training protocol. Furthermore, 
35 models from the Image Source Attribution UniCamp Dataset (Image Source 
Attribution UniCamp Dataset 2021) and 250 models from the Flickr image hosting 
service were used to simulate unknown models. 

As performed by Bondi et al. (2017a), the authors extracted 32 non-overlapping 
64 x 64 patches from images. Results demonstrated the highest effectiveness of 
CNN-based learned features over hand-crafted features. Many open-set classifiers 
returned high performances and, as reported in Bayar and Stamm (2018), ET clas- 
sifier was one of the best performing. Interestingly, the authors noticed that the 
third additional training protocol that required more “known-unknown” models was 
outperformed by a simpler solution, corresponding to the second training approach 
proposed by Bayar and Stamm (2018). 


Learned Forensics Similarity 

An interesting research work which parallels camera model identification was pro- 
posed in 2018 by Mayer and Stamm (2018). In their paper, the authors proposed a 
two-step learning framework that did not aim at identifying the source camera model 
of an image, but aimed at determining whether two images (or image patches) were 
shot by the same camera model. 

The main novelty did not lie in the feature extraction step, which was CNN- 
based as many other contemporaneous works, but in the second step developing a 
learned forensics similarity measure between the two compared patches. This was 
implemented by training a multi-layer neural network that mapped the features from 
patches into a single similarity score. If the score overcame a predefined threshold, 
the two patches were said to come from the same model. 

The authors did not limit the comparison over known camera models, but also 
demonstrated the effectiveness of the method on unknown models. In particular, 
they trained the two networks over disjoint camera model sets, exploiting grayscale 
image patches with size 256 x 256, selected from 20 models of the Dresden Image 
Database (Gloe and Böhme 2010) and 45 further models. Results showed that authors 
could accurately detect a model similarity even if both the two images came from 
unknown sources. 
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7.4 Datasets and Benchmarks 


This section provides a template with the fundamental characteristics an image 
dataset should have in order to explore camera model identification. Then, it shows 
an overview of available datasets in the literature. It also explains how to properly 
deal with these datasets for a fair evaluation. 


7.4.1 Template Dataset 


The task of camera model identification assumes that we are not interested in retriev- 
ing information on the specific device used for capturing a photograph neither on the 
scene content depicted. Given these premises, we can define some “good” features 
a template dataset should include (Kirchner and Gloe 2015): 


e Images of similar scenes taken with different camera models. 
e Multiple devices per camera model. 


Collecting several images with similar scenes shot by different models avoids any 
possible bias on the results due to the depicted scene content. Moreover, capturing 
images from multiple devices of the same camera model accounts for realistic sce- 
narios. Indeed, we should expect to investigate query images shot by devices never 
seen at algorithm development phase, even though their camera models were known. 

A common danger that needs to be avoided is to confuse model identification 
with device identification. Hence, as we want to solve the source identification task 
at model-based granularity, we must not confuse device-related traces with model- 
related ones. In other words, query images coming from an unknown device of a 
known model should not be confused as coming from an unknown source. In this 
vein, a dataset should consist of multiple devices from the same model. This enables 
to evaluate a feature set or a classifier for its ability to detect models independently 
from individual devices. Including more devices per model in the image dataset 
allows to check for this requirement and helps keeping the problem at bay. 


7.4.2 State-of-the-art Datasets 


The following datasets are frequently used or have been specifically designed for 
camera model identification: 


e Dresden Image Database (Gloe and Bohme 2010). 

Vision Image Dataset (Shullani et al. 2017). 

Forchheim Image Database (Hadwiger and Riess 2020). 

Dataset for Camera Identification on HDR images (Shaya et al. 2018). 
Raise Image Dataset (Nguyen et al. 2015). 
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Table 7.1 Main datasets’ characteristics 


Dataset No. of models No. of images Image formats 

Dresden (Gloe and Böhme 2010) | 26 18,456 JPEG, 
Uncompressed-Raw 

Vision (Shullani et al. 2017) 28 34,427 JPEG 

Forchheim (Hadwiger and Riess | 25 23,106 JPEG 

2020) 

HDR (Shaya et al. 2018) 21 5,415 JPEG 

Raise (Nguyen et al. 2015) 3 8,156 Uncompressed-Raw 

Socrates (Galdi et al. 2019) 60 9,700 JPEG 

IEEE Cup (Stamm et al. 2018; 10 2,750 JPEG 

IEEE Signal Processing Cup 2018 

Database 2021) 


e Socrates Dataset (Galdi et al. 2019); 

e the dataset for the IEEE Signal Processing Cup 2018: Forensic Camera Model 
Identification Challenge (Stamm et al. 2018; IEEE Signal Processing Cup 2018 
Database 2021). 


To highlight the differences among the datasets in terms of camera models and 
images involved, Table 7.1 summarizes the main datasets’ characteristics. In the fol- 
lowing, we illustrate in detail the main features of each dataset. 


Dresden Image Database 

The Dresden Image Database (Gloe and Bóhme 2010) has been designed with the 
specific purpose of investigating the camera model identification problem. This is 
a publicly available dataset, including approximately 17,000 full-resolution natu- 
ral images stored in the JPEG format with the highest available JPEG quality and 
1,500 uncompressed raw images. The image acquisition process has been carefully 
designed in order to provide image forensics analysts a dataset which satisfies the two 
properties reported in Sect. 7.4.1. Images were captured under controlled conditions 
from 26 different camera models considering up to 5 different instances per model. 
For each motif, at least 3 scenes were captured by varying the focal length. 

Thanks to the careful design process and to the considerable amount of images 
included, the Dresden Image Database has become over the years one of the most used 
image datasets for tackling image forensics investigations. The use of this dataset as 
a benchmark has favored the spreading of research works on the topic, easing the 
comparison between different methodologies and their reproducibility. Focusing on 
the research on the camera model identification task, the Dresden Image Database 
has been used several times by state-of-the-art works (Bondi et al. 2017b; Marra 
et al. 2017; Tuama et al. 2016). 


Vision Image Dataset 
The Vision dataset (Shullani et al. 2017) is a recent image and video dataset, pur- 
posely designed for multimedia forensics investigations. Specifically, Vision dataset 
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has been designed to follow the trend on image and video acquisition and social 
sharing. In the last few years, photo-amateurs have rapidly transitioned to hand-held 
devices as preferred mean to capture images and videos. Then, the acquired content 
is typically shared on social media platforms like WhatsApp or Facebook. In this 
vein, Vision dataset collects almost 12,000 native images captured by 35 modern 
smartphones/tablets, including also their related social media version. 

The Vision dataset well satisfies the firstrequirement presented in Sect. 7.4.1 about 
capturing images of similar scenes taken with different camera models. Moreover, 
it represents a substantial improvement with respect to Dresden Image Database, 
since it collects images from modern devices and social media platforms. However, 
a minor limitation regards the second requirement provided in Sect.7.4.1: there are 
only few camera models with two or more instances. Indeed, over the 35 available 
modern devices, we have 28 different camera models. 

In the literature, Vision dataset has been used many times for investigations on 
the camera model identification problem. Among them, we can cite (Cozzolino and 
Verdoliva 2020; Mandelli et al. 2020). 


Forchheim Image Database 

The Forchheim Image Database (Hadwiger and Riess 2020) consists of more than 
23,000 images of 143 scenes shot by 27 smartphone cameras of 25 models and 9 
brands. It has been proposed to cleanly separate scene content and forensic traces, 
and to support realistic post-processing like social media recompression. Indeed, six 
different qualities are provided per image, collecting different copies of the same 
original image passed through social networks. 


Dataset for Camera Identification on HDR Images 

The proposed dataset collects standard dynamic range and HDR images captured in 
different conditions, including various capturing motions, scenes, and devices (Shaya 
et al. 2018). It has been proposed to investigate the source identification problem on 
HDR images, which usually introduce some difficulties due to their complexity and 
wider dynamic range. 23 mobile devices were used for capturing 5,415 images in 
different scenarios. 


Raise Image Dataset 

The Raise Image Dataset (Nguyen et al. 2015) concerns 8,156 high-resolution raw 
images, depicting various subjects and scenarios. 3 different camera of diverse mod- 
els were employed. 


Socrates Dataset 

The Socrates Dataset (Galdi et al. 2019) has been built to investigate the source cam- 
era identification problem on images and videos coming from smartphones. Images 
and videos have been collected directly by smartphone owners, ending up with about 
9,700 images and 1000 videos captured with 104 different smartphones of 15 differ- 
ent makes and about 60 different models. 
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Dataset for the IEEE Signal Processing Cup 2018: Forensic Camera Model 
Identification Challenge 

The IEEE Signal Processing Cup (SP Cup) is a student competition in which under- 
graduate students form teams to work on real-life challenges. In 2018, the camera 
model identification goal was selected as the topic for the SP Cup (Stamm et al. 
2018). Participants were provided with a dataset consisting of JPEG images from 10 
different camera models (including point-and-shoot cameras, cell phone cameras, 
and digital single-lens reflex cameras), with 200 images captured using each camera 
model. In addition, also post-processed operations (e.g., JPEG recompression, crop- 
ping, contrast enhancement) were applied to the images. The complete dataset can 
be downloaded from IEEE DataPort (IEEE Signal Processing Cup 2018 Database 
2021). 


7.4.3 Benchmark Protocol 


The benchmark protocol commonly followed by forensics researchers is to divide 
the available images into three disjoint image datasets: the training dataset Z,, the 
validation dataset Z, and the evaluation dataset Ze. This split ensures that images 
seen during the training process (i.e., images belonging to either Z, or Z,) are never 
used in testing phase, thus they do not introduce any bias in the results. 

Moreover, it is reasonable to assume that query images under analysis might not 
be acquired with the same devices seen in training phase. Therefore, what is usually 
done is to pick a selection of devices to be used for the training process, namely the 
training device set D;. Then, the proposed method can be evaluated over the total 
amount of available devices. A realistic and challenging scenario is to include only 
one device per known camera model in the training set D,. 


7.5 Case Studies 


This section is devoted to numerical analysis of some selected methods in order to 
showcase the capabilities of modern camera model identification algorithms. Specif- 
ically, we consider a set of baseline CNNs and co-occurrences of image residuals 
(Fridrich and Kodovsky 2012) analyzing both closed-set and open-set scenarios. The 
impact of JPEG compression is also investigated. 


7.5.1 Experimental Setup 


To perform our experiments, we select natural JPEG compressed images from the 
Dresden Image Database (Gloe and Böhme 2010) and the Vision Image Dataset 


7 Source Camera Model Identification 155 


(Shullani et al. 2017). Regarding the Dresden Image Database, we consider images 
from “Nikon_D70” and “Nikon_D70s” camera models as coming from the same 
model, as the differences between these two models are negligible due to a minor 
version update (Gloe and Bóhme 2010; Kirchner and Gloe 2015). We pick the same 
number of images per device, and use as reference the device with the lowest image 
cardinality. We end up with almost 15,000 images from 54 different camera models, 
including 108 diverse devices. 

We work in a patch-wise fashion, extracting N patches with size P x P pixels 
from each image. We investigate four different networks. Two networks are selected 
from the recently proposed EfficientNet family of models (Tan and Le 2019), which 
achieves very good results both in computer vision and multimedia forensics tasks. 
Specifically, we select EfficientNetBO and EfficientNetB4 models. The other net- 
works are known in literature as ResNet50 (He et al. 2016) and XceptionNet (Chollet 
2017). Following a common procedure in CNN training, we initialize the network 
weights using those trained on ImageNet database (Deng et al. 2009). All CNNs are 
trained using cross-entropy loss and Adam optimizer with default parameters. The 
learning rate is initialized to 0.001 and is decreased by a factor 10 whenever the 
validation loss does not improve for 10 epochs. The minimum accepted learning rate 
is set to 1078. We train the networks for at most 500 epochs, and training is stopped 
if loss does not decrease for more than 50 epochs. The model providing the best 
validation loss is selected. 

At test time, classification scores are always fed to the softmax function. In the 
closed-set scenario, we assign the query image to the camera model associated with 
the highest softmax score. We use the average accuracy of correct predictions as 
evaluation metrics. In the open-set scenario, we evaluate results as a function of 
the detection accuracy of “known” versus “unknown” models. Furthermore, we also 
provide the average accuracy of correct predictions over the set of known camera 
models. In other words, given that a query image was taken with one camera model 
belonging to the known class, we evaluate the classification accuracy as in a closed- 
set problem reduced to the “known” categories. 

Concerning the dataset split policy, we always keep 80% of the images for training 
phase, further divided in 85%/15% for training set Z, and validation set Z,, respec- 
tively. The remaining 20% of the images are used in the evaluation set Ze. All tests 
have been run on a workstation equipped with one Intel® Xeon Gold 6246 (48 Cores 
@3.30GHz), RAM 252 GB, one TITAN RTX (4608 CUDA Cores @1350MHz), 
24 GB, running Ubuntu 18.04.2. We resort to Albumentation (Buslaev et al. 2020) 
as data augmentation library for applying JPEG compression to images, and we use 
Pytorch (Paszke et al. 2019) as Deep Learning framework. 


7.5.2 Comparison of Closed-Set Methods 


In this section, we compare closed-set camera model identification methodologies. 
We do not consider model-based approaches, since data-driven methodologies have 
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extensively outperformed them in the last few years (Bondi et al. 2017a; Marra 
et al. 2017). In particular, we compare 4 different CNN architectures with a well- 
known state-of-the-art method based on hand-crafted feature extraction. Specifically, 
we extracted the co-occurrences of image residuals as suggested in Fridrich and 
Kodovsky (2012). Indeed, it has been shown that exploiting these local features 
provides valuable insights on the camera model identification task (Marra et al. 
2017). 

Since data-driven methods need to be trained on a specific set of images (or image- 
patches), we consider different scenarios in which the characteristics of the training 
dataset change. 


Training with Variable Patch-sizes, Proportional Number of Patches per Image 
In this setup, we work with images from both the Dresden Image Database and the 
Vision Image Dataset. We randomly extract N patches per image with size P x P 
pixels. We consider patch-sizes P € (256, 512, 1024}. As first experiment, we would 
like to maintain a constant number of image pixels seen in training phase. In doing 
so, we can compare the methods’ performance under the same number of input 
pixels. Hence, the smaller the patch-size, the more image patches are provided. In 
case of P = 256, we randomly extract N = 40 patches per image; for P = 512, we 
randomly extract N = 10 patches per image; when P = 1024, we randomly extract 
N = 3 patches per image. The number of input image pixels remains constant in the 
first two situations, while the last scenario includes few pixels more. 

Co-occurrence Based Local Features. We extract co-occurrences features of 625 
elements (Fridrich and Kodovsky 2012) from each analyzed patch independent of 
the input patch-size P. More precisely, we extract features based on the third order 
filter named “s3-spam14hv”, as suggested in Marra et al. (2017). We apply this filter 
to the luminance component of the input patches. 

Then, to associate the co-occurrences features with one camera model, we train 
a 54-classes classifier composed of a shallow neural network. The classifier is 
defined as 


e a fully connected layer with 625 input channels, i.e., the dimension of co- 
occurrences, and 256 output channels; 

e a dropout layer with 0.5 as dropout ratio; 

e a fully connected layer with 256 input channels and 54 output channels. 


We train this elementary network using Adam optimization with initial learning rate 
of 0.1, following the same paradigm described in Sect. 7.5.1. Results on evaluation 
images are shown in Table 7.2. 


Table 7.2 Accuracy of camera model identification in closed-set scenario using co-occurrences 
features 


Patch-size P 256 512 1024 
Accuracy (%) 68.77 78.81 82.77 
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Fig. 7.3 Confusion matrix achieved in closed-set classification by co-occurrences extraction, P = 
1024 


Notice that the larger the patch-size, the higher the achieved accuracy. Even though 
the number of image pixels seen by the classifier in training phase is essentially 
constant, the features extracted from the patches become more significant as the 
pixel region fed to the co-occurrence extraction grows. Figure 7.3 shows the achieved 
confusion matrix for P = 1024. 

CNNs. CNN results are shown in Table7.3. Differently from co-occurrences, 
CNNSs seem to be considerably less dependent on the specific patch-size, as long as the 
number of image pixels fed to the network remains the same. All the network models 
perform slightly worse on larger patches (i.e., P = 1024). This lower performance 
may originate from a higher difficulty of the networks in converging during the 
training process. Larger patches inevitably require reduced batch-size during training. 
Less samples in the batch means less results’ average in one training epoch and 
reduced CNN capabilities in converging to the best parameters. 
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Table7.3 Accuracy of closed-set camera model identification using 4 different CNN architectures 
as a function of the patch-size. The number of extracted patches per image varies such that the 
overall image pixels seen by CNNS is almost constant 


Patch-size P 256 512 1024 
EfficientNetB0 (%) 94.68 94.60 94.28 
EfficientNetB4 (%) 94.29 94.57 93.20 
ResNet50 (%) 93.71 92.48 91.70 
XceptionNet (%) 93.74 94.21 91.23 
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Fig. 7.4 Confusion matrix achieved in closed-set classification by EfficientNetB0, P = 512, 
N = 10 


Overall, results on CNNSs strongly outperform co-occurrences. Fixing a patch- 
size, all the CNNs return very similar results. The EfficientNet family of models 
slightly outperforms the other CNN architectures, as it has been shown several times 
in the literature (Tan and Le 2019). 

To provide an example of the results, Fig. 7.4 shows the achieved confusion matrix 
by EfficientNetBO architecture for P = 512, N = 10. 
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Training with Fixed Patch-sizes, Variable Number of Patches per Image 
In this setup, we work again with images from both the Dresden Image Database and 
the Vision Image Dataset. Differently from the previous setup, we do not evaluate 
results of co-occurrences since they are significantly outperformed by the CNN- 
based framework. Instead of exploring diverse patch-sizes, we now fix the patch-size 
P = 256. We aim at investigating how CNN performance changes by varying the 
number of extracted patches per image. N can vary among {1, 5, 10, 25, 40}. 
Figure 7.5 depicts the closed-set classification accuracy results. For small values 
of N (i.e., N < 10), all the 4 networks enhance their performance as the number of 
extracted patches increases. When N further increases, CNNs achieve a performance 
plateau and accuracy does not change too much across N e {10, 25, 40}. 


Training with Variable Patch-size, Fixed Number of Patches per Image 
We now fix the number of extracted patches per image to N = 10. Then, we vary 
the patch-size P among {64, 128, 256, 512}. The goal is to explore whether results 
change as a function of the patch-size. 

Figure 7.6 depicts the results. All the CNNs achieve accuracies beyond 90% when 
patch-size is larger or equal to 256. When the data fed to the network are too small, 
CNNS are not able to estimate well the classification parameters. 


Investigations on the Influence of JPEG Grid Alignment 

When editing a photograph or uploading a picture over a social media platform, it may 
be cropped with respect to its original size. After the cropping, a JPEG compression 
is usually applied to the image. These operations may de-synchronize the 8 x 8 
characteristic pixel grid of the original JPEG compression. Usually, a new 8 x 8 grid 
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non-aligned with the original one is generated. When training a CNN to solve an 
image classification problem, the presence of JPEG grid misalignment on images 
can strongly impact the performance. In this section, we summarize the experiments 
performed in Mandelli et al. (2020) to investigate on the influence of JPEG grid 
alignment on the closed-set camera model identification task. 

During training and evaluation steps, we investigate two scenarios: 


1. working with JPEG compressed images whose JPEG grid is aligned to the 8 x 8 
pixel grid starting from upper-left corner; 

2. working with JPEG compressed images whose JPEG grid starts in random pixel 
position. 


To simulate these scenarios, we first compress all the images with the maximum 
JPEG Quality Factor (QF), i.e., QF = 100. This process generates images with JPEG 
lattice and does not impair the image visual quality. The first scenario includes images 
cropped such that the extracted image patches are always aligned to the 8 x 8 pixel 
grid. In the latter scenario, we extract patches in random positions. Differently from 
previous sections, here we are considering a reduced image dataset, the same used in 
Mandelli et al. (2020) from which we are reporting the results. Images are selected 
only from the Vision dataset and N = 10 squared patches of 512 x 512 pixels are 
extracted from the images. 

Figure 7.7 shows the closed-set classification accuracy for all CNNs as a func- 
tion of the considered scenarios. In particular, Fig. 7.7a depicts results in case we 
train and test on randomly cropped patches; Fig. 7.7b shows what happens when 
training on JPEG-aligned images and testing on randomly cropped images; Fig. 7.7c 
explores training on randomly cropped images and testing on JPEG-aligned images; 
Fig. 7.7d draws results in case we train and test on JPEG-aligned images. It is worth 
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Fig. 7.7 Closed-set classification accuracy achieved by CNNS as a function of JPEG grid align- 
ment in training and/or evaluation phases. The bars represent, respectively, from left to right: M 
EfficientNetB0; | EfficientNetB4; ” ResNet50; ° XceptionNet. In a, we train and test on ran- 
domly cropped patches; in b, we train on JPEG-aligned and test on randomly cropped patches; 
in c, we train on randomly cropped and test on JPEG-aligned patches; in d, we train and test on 
JPEG-aligned patches 


noticing that being careful to the JPEG grid alignment is paramount for achieving 
good accuracy. When training a detector on JPEG-aligned patches and testing on 
JPEG-misaligned ones, results drop consistently as shown in Fig. 7.7b. 


Investigations on the Influence of JPEG Quality Factor 
In this section, we aim at investigating how the QF of JPEG compression affects 
CNN performance in closed-set camera model identification. As done in the previous 
section, we illustrate the experimental setup provided in Mandelli et al. (2020), 
where the effect of JPEG compression quality is studied for images belonging to 
the Vision dataset. We JPEG-compress the images with diverse QFs, namely QF € 
{50, 60, 70, 80, 90, 99}. Then, we extract N = 10 squared patches of 512 x 512 
pixels from images. Following previous considerations, we randomly extract the 
patches both for the training and evaluation datasets. 

To explore the influence of JPEG QF on CNN results, we train the network in two 
ways: 


1. we use only images of the original Vision Image dataset; 

2. we perform some training data augmentation. Half of the training images are 
taken from the original Vision Image dataset; the remaining part is compressed 
with a JPEG QF picked from the reported list. 


Notice that the second scenario assumes some knowledge on the JPEG QF of the 

evaluation images, and can thus improve the achieved classification results. 
Closed-set classification accuracy achieved by the CNNS is reported in Fig. 7.8. 

In particular, we show results as a function of the QF of the evaluation images. 


162 S. Mandelli et al. 


Accuracy 
Accuracy 


Accuracy 
Accuracy 


0 0 
50 60 70 8 90 9 50 60 70 8 9 9 
Test QF Test QF 
(O) (a) 


Fig. 7.8 Accuracy as a function of test QF for a EfficientNetB0, b EfficientNetB4, c ResNet50, d 
XceptionNet. The curves are drawn in accordance to the following legend: @ train on QF = 50; 8 
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The gray curve represents the first training scenario, i.e., in absence of training data 
augmentation. Note that this setup always draws the worst results. Also training on 
data augmented with QF = 99 almost corresponds to absence of augmentation and 
achieves acceptable results only when the test QF matches the training one. On the 
contrary, training on data augmented with low JPEG QFs (i.e., QF € {50, 60}) returns 
acceptable results for all the possible test QFs. For instance, by training on data com- 
pressed with QF = 50, evaluation accuracy can improve from 0.2 to more than 0.85. 
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Z.5.3 Comparison of Open-Set Methods 


In the open-set scenario some camera models are unknown, so we cannot train CNNs 
on the complete device set as information can be extracted only from the known 
devices. Let us represent the original camera model set with 7 : known models are 
denoted as Z¿, unknown models are denoted as 7;. 

To perform open-set classification, two main training strategies can be pursued 
(Bayar and Stamm 2018; Júnior et al. 2019): 


1. Consider all the available set J; as “known” camera models. Train one closed-set 
classifier over Z¿ camera models. 

2. Divide the set of 7 camera models into two disjoint sets: the “known-known” set 
Tk., and the “known-unknown” set Z, . Train one binary classifier to identify Z, 
camera models from 7%, camera models. 


Results are evaluated accordingly to two metrics (Bayar and Stamm 2018): 


1. the accuracy of detection between known camera models and unknown camera 
models; 

2. the accuracy of closed-set classification among the set of known camera models. 
This metric is validated only for images belonging to camera models in set 7%. 


Our experimental setup considers only EfficientNetBO as network architecture, 
since EfficientNet family of models (both EfficientNetBO and EfficientNetB4) always 
reports higher or comparable accuracies with respect to other architectures. We 
choose EfficientNetBO which is lighter than EfficientNetB4 to reduce training and 
testing time. We exploit images from both the Dresden Image Database and the Vision 
Image Dataset. Among the pool of 54 camera models, we randomly extract 36 cam- 
era models for the “known” set 7, and leave the remaining 18 to the “unknown” 
set 7„. Following considerations, we randomly extract N = 10 squared patches per 
image with patch-size P = 512. 


Training One Closed-set Classifier over “Known” Camera Models 

We train one closed-set classifier over models belonging to Z, set by following the 
same training protocol previously seen for closed-set camera model identification. 
In testing phase, we proceed with two consequential steps: 


1. detect if the query image is taken by a “known” camera model or an “unknown” 
model; 

2. if the query image comes from a “known” camera model, identify the source 
model. 


Regarding the first step, we follow the approach suggested in Bayar and Stamm 
(2018). Given a query image, if the maximum score returned by the classifier exceeds 
a predefined threshold, we assign the image to the category of “known” camera 
models. Otherwise, the image is associated with an “unknown” model. Following 
this procedure, the closed-set classification accuracy across models in Z, is 95%. 
Figure 7.9 shows the confusion matrix achieved by closed-set classification. 
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Fig.7.9 Confusion matrix achieved in closed-set classification by training one classifier (in closed- 
set fashion) over the 36 camera models in 7%. The average closed-set accuracy is 95% 


The detection accuracy varies with the threshold used to classify “known” ver- 
sus “unknown” camera models. Figure 7.10 depicts the ROC curve achieved by the 
proposed method. Specifically, we see the behavior of True Positive Rate (TPR) as 
a function of the False Positive Rate (FPR). The maximum accuracy value is 86.19%. 


Divide the Set of Known Camera Models into “Known-known” and “Known- 
unknown” 

In this experiment, we divide the set of known camera models into two disjoint sets: 
the set of “known-known” models J, and the set of “known-unknown” models Z, . 
Then, we train a binary classifier which discriminates between the two classes. In 
testing phase, all the images assigned to the “known-unknown” class will be classified 
as unknown. Notice that by training one binary classifier we are able to recognize 
as “known” category only images taken from camera models in J, . Therefore, even 
if camera models belonging to 7%, are known, their images will be classified as 
unknown in testing phase. 
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To overcome this limitation, we explore the possibility of training a binary classi- 
fier in different setups. In practice, we divide the known camera models’ set 7% into 
C disjoint subsets containing images of |Z, |/ C models, where | - | is the cardinality 
of the set. We define every disjoint set as Z+, c € [1, C]. Then, for each disjoint set 
Tk., we train one binary classifier telling known models and unknown models apart. 
Set Z,. represents the “known-known” class, while the remaining sets are joined and 


form the “known-unknown” class. F 


igure 7.11 depicts the split policy applied to Z 


for C = 2 and C = 3. For instance, when C = 3, we end up with 3 different training 
setups: the first considers Z¿, as known class and models in {7% , Tk, } as unknowns; 


166 S. Mandelli et al. 


Fig. 7.12 Maximum 
detection accuracy achieved 0.88 | 
in the ROC curve by training 
C classifiers over camera 
models in J; 


0.864 


0.841 


Accuracy 


0.821 


the second considers Z¿, as known class and models in (Z¿,, Z,;} as unknowns; the 
third considers Z¿, as known class and models in {Z,, Z¿,] as unknowns. 

In testing phase, we assign a query image to the “known-known” class only if at 
least one classifier returned a score associated with the “known-known” class which 
is sufficiently confident. In practice, we threshold the maximum score associated 
to the “known-known” class returned by the C classifiers. Whenever the maximum 
score overcomes the threshold, we assign the query image to a known camera model; 
otherwise we associate it with an unknown model. 

Notice that the number of classifiers can vary from C = 2 to C = 36 (i.e., the 
total number of available known camera models). In our experiments, we consider C 
equal to all the possible divisors of 36, i.e., C € {2, 3,4, 6, 9, 12, 18, 36}. In order 
to work with balanced training datasets, we always fix the number of images for each 
class to be equal to the image cardinality of the known set Tp,- 

The maximum detection accuracy achieved as a function of the number of clas- 
sifiers used is shown in Fig. 7.12. The case C = 1 corresponds to training only one 
classifier in closed-set fashion. Notice that when the number of classifiers starts to 
increase (i.e., when C > 9), the proposed methodology can outperform the closed-set 
classifier. 

We also evaluate the accuracy achieved in closed classification for images belong- 
ing to known models in 7%. For each disjoint set Z+, we can train a |7%|/ C-class clas- 
sifier in closed-set fashion. If one of the C binary classifiers assigns the query image 
to the 7%. set, we can identify the source camera model by exploiting the related 
closed-set classifier. This procedure is not needed in case C = |%|, i.e., C = 36. 
In this case, we can exploit the maximum score returned by the C binary classi- 
fiers. If the maximum score is related to the “known-known” class, the estimated 
source camera model is associated with the binary classifier providing the highest 
score. If the maximum score is related to the “known-unknown’”’ class, the estimated 
source camera model is unknown. For example, Fig. 7.13 shows the confusion matrix 
achieved in closed-set classification by training 36 classifiers. The average closed-set 
accuracy is 99.56%. 

Exploiting C = 36 binary classifiers outperforms the previously presented closed- 
set classifier, both in known-vs-unknown detection and closed-set classification. This 


7 Source Camera Model Identification 


Agfa-DC-733s 
Agfa_DC-830i} 

Agfa Sensor505-x} 
Canon_baus70} 
Canon_PowerShot A640} 
Casio_EX-Z150} 
Kodak_M1063) 
Nikon_CoolPixS710} 
Nikon_D200} 
Nikon_D70)} 
Olympus_mju-1050SW | 
Pentax_OptioA40) 


Rollei_RCP-7. 
Samsung_L74wide} 
Samsung_NV15) 

Sony _DSC-H50) 
Sony_DSC-T77| 

Apple iPhone4s} 
Huawei_P9} 
AppleiPhonedc} 
Lenovo_P70A} 

Apple iPhone4} 
Samsung_GalaxyS3} 
Sony XperiaZ1Compact | 
Huawei_P9Lite 
Wiko_Ridge4G; 
Samsung_GalaxyTrendPlus| 
Xiaomi_RedmiNote3} 
OnePlus_A3000} 


True label 


Samsung_Gala 
OnePh 
Huawei_Ascend | 

Samsung_GalaxyTabA | 


Wiko_Ridge4G, 


Sony_DSC-HE 


Predicted label 


Samsung_Ga 


C 


Samsung-GalaxyTabA 


Samsung_G: 


Unknown} 


167 


1.0 


0.8 


0.6 


0.4 


0.2 


‘0.0 


Fig. 7.13 Confusion matrix achieved in closed-set classification by training 36 classifier, consid- 


ering the 36 camera models in 7. The average closed-set accuracy is 99.56% 


comes at the expense of training 36 binary classifiers instead of one C-class classifier. 
However, each binary classifier is trained over a reduced amount of images, i.e., twice 
the number of images of the “known-known” class. On the contrary, the C-class 
classifier is trained over the total amount of known images. Therefore, the required 
computation time by training 36 classifiers does not significantly increase and it is 


still acceptable. 


7.6 Conclusions and Outlook 


In this chapter, we have reported multiple examples of camera model identification 
algorithms developed in the literature. From this report and the showcased exper- 
imental results, it is possible to understand that the camera model identification 
problem can be well solved in some situations, but not always. 
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For instance, good solutions exist for solving the camera model identification task 
in aclosed-set scenario, when images have not been further post-processed or edited. 
Given a set of known camera models and a pool of training images, standard CNNs 
can solve the task in a reasonable amount of time with high accuracy. 

However, this setup is too optimistic. In practical situations, it is unlikely that 
images never undergo any processing step after acquisition. Moreover, it is not real- 
istic to assume that an analyst has access to all possible camera models needed for an 
investigation. In this situation, further research is needed. As an example, compres- 
sion artifacts and other processing operations like cropping or resizing can strongly 
impact on the achieved performance. For this reason, a deeper analysis on images 
shared through social media networks is worthy of investigations. 

Moreover, open-set scenarios are still challenging even considering original (i.e., 
not post-processed) images. The proposed methods in the literature are not able to 
approach closed-set accuracy yet. As the open-set scenario is more realistic than its 
closed-set counterpart, further work in this direction should be done. 

Furthermore, the newest smartphone technologies based on HDR images, multi- 
ple cameras, super-dense sensors and advanced internal processing (e.g., “beautify” 
filters, AIl-guided enhancements, etc.) can further complicate the source identifica- 
tion problem, due to an increased image complexity (Shaya et al. 2018; Iuliani et al. 
2021). For instance, it has been already shown that sensor-based methodologies to 
solve the camera attribution problem may suffer for the portrait mode employed in 
smartphones of some particular brands (Iuliani et al. 2021). When portrait modality 
is selected to capture a human subject in foreground and blur the remaining back- 
ground, sensor traces risk to be hindered (Iuliani et al. 2021). 

Finally, we have not considered scenarios involving attackers. On one hand, 
antiforensic techniques could be applied to images in order to hide camera model 
traces. Alternatively, working with data-driven approaches, it is important to consider 
possible adversarial attacks to the used classifiers. In the light of these considerations, 
forensic approaches for camera model identification in modern scenarios still have 
to be developed. Such future algorithmic innovations can be further fostered with 
novel datasets that include these most recent challenges. 
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Chapter 8 A) 
GAN Fingerprints in Face Image gee 
Synthesis 


João C. Neves, Ruben Tolosana, Ruben Vera-Rodriguez, Vasco Lopes, 
Hugo Proença, and Julian Fierrez 


The availability of large-scale facial databases, together with the remarkable pro- 
gresses of deep learning technologies, in particular Generative Adversarial Networks 
(GANSs), have led to the generation of extremely realistic fake facial content, raising 
obvious concerns about the potential for misuse. Such concerns have fostered the 
research on manipulation detection methods that, contrary to humans, have already 
achieved astonishing results in various scenarios. This chapter is focused on the anal- 
ysis of GAN fingerprints in face image synthesis. In particular, it covers an in-depth 
literature analysis of state-of-the-art detection approaches for the entire face synthe- 


'The present chapter is an adaptation from the following article: Neves et al. (2020). DOI: http:// 
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sis manipulation. It also describes a recent approach to spoof fake detectors based 
ona GAN-fingerprint Removal autoencoder (GANprintR). A thorough experimental 
framework is included in the chapter, highlighting (i) the potential of GANprintR 
to spoof fake detectors, and (ii) the poor generalisation capability of current fake 
detectors. 


8.1 Introduction 


Images! and videos containing fake facial information obtained by digital manipula- 
tion have recently become a great public concern (Cellan-Jones 2019). Up until the 
advent of DeepFakes a few years ago, the number and realism of digitally manip- 
ulated fake facial contents were very limited by the lack of sophisticated editing 
tools, the high domain of expertise required, and the complex and time-consuming 
process involved to generate realistic fakes. The scientific communities of biometrics 
and security in the past decade paid some attention in understanding and protecting 
against those limited threats around face biometrics (Hadid et al. 2015), with special 
attention to presentation attacks conducted physically against the face sensor (cam- 
era) using various kinds of face spoofs (e.g. 2D or 3D printed, displayed, mask-based, 
etc.) (Hernandez-Ortega et al. 2019; Galbally et al. 2014). 

However, nowadays it is becoming increasingly easy to automatically synthesise 
non-existent faces or even to manipulate the face of a real person in an image/video, 
thanks to the free access to large public databases and also to the advances on deep 
learning techniques that eliminate the requirements of manual editing. As a result, 
accessible open software and mobile applications such as ZAO and FaceApp have 
led to large amounts of synthetically generated fake content (ZAO 2019; FaceApp 
2017). 

The current methods to generate digital fake face content can be categorised into 
four different groups, regarding the level of manipulation (Tolosana et al. 2020c; 
Verdoliva 2020): (i) entire face synthesis, (ii) face identity swap, (iii) facial attribute 
manipulation and (iv) facial expression manipulation. 

In this chapter, we focus on the entire face synthesis manipulation, where 
a machine learning model, typically based on Generative Adversarial Networks 
(GANs) (Goodfellow et al. 2014), learns the distribution of the human face data, 
allowing to generate non-existent faces by sampling this distribution. This type of 
facial manipulation provides astonishing results and is able to generate extremely 
realistic fakes. Nevertheless, contrary to humans, most state-of-the-art detection sys- 
tems provide very good results against this type of facial manipulation, remarking 
how easy it is to detect the GAN “fingerprints” present in the synthetic images. 

This chapter covers the following aspects in the topic of GAN Fingerprints: 


e An in-depth literature analysis of the state-of-the-art detection approaches for 
the entire face synthesis manipulation, including the key aspects of the detection 
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Fig. 8.1 Architecture of the GAN-fingerprint removal approach. In general, the state-of-the- 
art face manipulation detectors can easily distinguish between real and synthetic fake images. This 
usually happens due to the existence and exploitation by those detectors of GAN “fingerprints” 
produced during the generation of synthetic images. The GANprintR approach proposed in (Neves 
et al. 2020) aims to remove the GAN fingerprints from the synthetic images and spoof the facial 
manipulation detection systems, while keeping the visual quality of the resulting images 


systems, the databases used for developing and evaluating these systems, and the 
main results achieved by them. 

e Anapproach to spoof state-of-the-art facial manipulation detection systems, while 
keeping the visual quality of the resulting images. Figure8.1 graphically sum- 
marises the approach presented in Neves et al. (2020) based on a GAN-fingerprint 
Removal autoencoder (GANprintR). 

e A thorough experimental assessment of this type of facial manipulation consider- 
ing fake detection (based on holistic deep networks, steganalysis, and local arti- 
facts) and realistic GAN-generated fakes (with and without the proposed GAN- 
printR) over different experimental conditions, i.e. controlled and in-the-wild sce- 
narios. 

e A recent database named iFakeFaceDB,” resulting from the application of the 
GANprintR approach to already very realistic synthetic images. 


2 https://github.com/socialabubi/iFakeFaceDB. 
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The remainder of the chapter is organised as follows. Section8.2 summarises 
the state of the art on the exploitation of GAN fingerprints for the detection of 
entire face synthesis manipulation. Section 8.3 explains the GAN-fingerprint removal 
approach (GANprintR) presented in Neves et al. (2020). Section 8.4 summarises the 
key features of the real and fake databases considered in the experimental assessment 
of this type of facial manipulation. Sections 8.5 and 8.6 describe the experimental 
setup and results achieved, respectively. Finally, Sect. 8.7 draws the final conclusions 
and points out some lines for future work. 


8.2 Related Work 


Contrary to popular belief, image manipulation dates back to the dawn of photog- 
raphy. Nevertheless, image manipulation only became particularly important after 
the rise of digital photography, due to the use of image processing techniques or 
low-cost image editing software. As a consequence, in the last decades the research 
community devised several strategies for assuring authenticity of digital data. In 
addition, digital image tampering still required some level of expertise to deceive the 
humans’ eye, and both factors helped reducing significantly the use of manipulated 
content for malicious purposes. However, after the proposal of Generative Adversar- 
ial Networks (Goodfellow et al. 2014), the possibility of synthesising realistic digital 
content became possible. Among the four possible levels of face manipulation, this 
chapter focuses on the entire face synthesis manipulation, particularly on the problem 
of distinguishing between real and fake facial images. 

Typically, synthetic face detection methods rely on the “fingerprints” caused by 
the generation process. According to the type of fingerprints used, each approach 
can be broadly divided into three categories: (i) methods based on visual artifacts; 
(ii) methods based on frequency analysis; and (iii) learning-based approaches for 
automatic fingerprint estimation. Table 8.1 provides a comparison of the state-of- 
the-art synthetic face detection methods. 

The following sections describe the state-of-the-art techniques for synthetic data 
generation and review the state-of-the-art methods capable of detecting synthetic 
face imagery according to the taxonomy described above. 


8.2.1 Generative Adversarial Networks 


Proposed by Goodfellow et al. (2014), GANs are a novel generative concept, com- 
posed of two neural networks contesting each other in the form of a competition. A 
generator learns to generate instances that resemble the training data, while a dis- 
criminator learns to distinguish between the real and the generated images, while 
serving the goal of penalising the generator. The goal is to have a generator that 
can learn how to generate plausible images that can fool the discriminator. While 
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at the beginning, GANs were only capable of producing low-resolution images of 
faces with some notorious visual artifacts, in the last years several techniques have 
emerged for synthesising highly realistic content (including BigGAN Brock et al. 
2019, CycleGAN Zhu et al. 2017, GauGAN Park et al. 2019, ProGAN Karras et al. 
2018, StarGAN Choi et al. 2018, StyleGAN Karras et al. 2019, and StyleGAN2 
Karras et al. 2020) that even humans cannot distinguish from the real ones. Next, we 
review the state-of-the-art approaches specifically devised for detecting a entire face 
synthesis manipulation. 


8.2.2 GAN Detection Techniques 


As denoted before, the images generated by the initial versions of GANs exhibited 
several visual artifacts, including distinct eye colour, holes in the face, deformed 
teeth, among others. For this reason, several approaches attempted to leverage these 
traits for detecting face manipulations (Matern et al. 2019; Yang et al. 2019; Hu et al. 
2020). Matern et al. (2019) extracted several geometric facial features which were 
then fed to a Support Vector Machine (SVM) classifier to distinguish between real 
and synthetic face images. Yang et al. (2019) exploited the weakness of GANs in 
generating consistent head poses and trained a SVM to distinguish between real and 
synthetic faces based on the estimation of the 3D head pose. As the remaining artifacts 
became less noticeable, researchers focused on more subtle features of the face, as 
in Hu et al. (2020), where synthetic face detection was performed by analysing the 
difference between the two corneal specular highlights. Other visual artifact typically 
exploited is the probability distribution of colour channels. McCloskey and Albright 
(McCloskey and Albright 2018) hypothesised that the colour is markedly different 
between real camera images and fake synthesis images, and proposed a detection 
system based on the colour histogram and a linear SVM. He et al. (2019) exploited 
different colour channels (YCbCr, HSV and Lab) to extract from a CNN different 
deep representations, which were subsequently fed to a Random Forest classifier 
for distinguishing between real and synthetic data. Li et al. (2020) observed that it 
is easier to spot the differences between real and GAN-generated data in non-RGB 
colour spaces, since GANs are trained for producing content in RGB channels. 

As the quality and realism of synthetic data improved, visual artifacts started to 
become ineffectual, which in turn fostered researchers to explore digital forensic 
techniques for the problem of synthetic data detection. Each camera sensor leaves 
a unique and stable mark on each acquired photo, denoted as the photo-response 
non-uniformity (PRNU) pattern (Lukas et al. 2006). This mark is usually denoted as 
the camera fingerprint, which inspired researchers to detect the presence of similar 
patterns in images synthesised by GANs. These approaches usually define the GAN 
fingerprint as a high-frequency signal available in the image. Marra et al. (2019a) 
defined GAN fingerprint as the high-level image information obtained by subtracting 
the image from its corresponding denoised version. Yu et al. (2018) improved (Marra 
et al. 2019a) by subtracting from the original image the corresponding reconstructed 
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version obtained from an autoencoder, which was tuned based on the discriminability 
of the fingerprints inferred by this process. They learned a model fingerprint for each 
source (each GAN instance plus the real world), such that the correlation index 
between one image fingerprint and each model fingerprint gives the probability of 
the image being produced by a specific model. Their proposed approach was tested 
using real faces from CelebA database (Liu et al. 2015) and synthetic faces created 
through different GAN approaches (PGGAN Karras et al. 2018, SNGAN Miyato 
et al. 2018, CramerGAN Bellemare et al. 2017, and MMDGAN Binkowski et al. 
2018), achieving a final accuracy of 99.50% for the best performance. Later, they 
extended their approach (Yu et al. 2020b) by proposing a novel strategy for the 
training of the generative model such that the fingerprints can be controlled by the 
user, and easily decoded from a synthetic image, allowing to solve the problem of 
source attribution, i.e. identifying the model that generated the image. In (Albright 
and McCloskey 2019), the authors proposed an alternative to (Yu et al. 2018) by 
replacing the autoencoder by an inverted GAN capable of reconstructing an image 
based on the attributes inferred from the original image. Zhang et al. (2019) proposed 
the use of the up-sampling artifact in the frequency domain as a discriminative feature 
for distinguishing veridical and synthetic data. Frank et al. (2020) reported similar 
conclusions regarding the discriminability of the frequency space of GAN-generated 
images. They relied on the Discrete Cosine Transform (DCT) for extracting features 
from either real and fake images, in order to train a linear classifier. Durall et al. (2020) 
found out that upconvolution or transposed convolution layers of GAN architectures 
are not capable of reproducing the spectral distribution of natural images. Based 
on this finding, they showed that generated face images can be easily identified 
by training a SVM with the features extracted with the Discrete Fourier Transform 
(DFT). Guarnera et al. (2020) used pixel correlation as a GAN fingerprint, since they 
noticed that the correlation of pixels in synthetic images are exclusively dependent 
on the operations performed by all the layers present in the GAN which generate it. 
Their proposed approach was tested using fake images generated by several GAN 
architectures (AttGAN, GDWCT, StarGAN, StyleGAN and StyleGAN2). 

A distinct family of methods adopts a data-driven strategy for the problem of 
detecting GAN-generated imagery. In this strategy, a standard image classifier, typ- 
ically a Convolutional Neural Network (CNN), is trained directly with raw images 
or through a modified version of them (Barni et al. 2020; Hsu et al. 2020). Marra 
et al. (2018) carried out a study about the classification accuracy of different CNN 
architectures when fed with raw images. It was observed that, in spite almost ideal per- 
formance was obtained, the performance decreased significantly when compressed 
images were used in the test set. Later, the authors proposed a strategy based on 
incremental learning for addressing this problem and the generalisation to unseen 
datasets (Marra et al. 2019c). Inspired by the forensic analysis of image manipulation 
(Cozzolino et al. 2014), Nataraj et al. (2019a) proposed a detection system based on 
a combination of pixel co-occurrence matrices and CNNs. Their proposed approach 
was initially tested in a database of various objects and scenes created through Cycle- 
GAN (Zhu et al. 2017). Besides, the authors performed an interesting analysis to see 
the robustness of the proposed approach against fake images created through differ- 
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ent GAN architectures (CycleGAN vs. StarGAN), with good generalisation results. 
This idea was later improved in (Goebel et al. 2020) and (Barni et al. 2020). 

The above studies show that a simple CNN is able to easily distinguish between 
real and synthetic data generated from specific GAN architectures, but is not capable 
of maintaining the same performance in data originated from GAN architectures not 
seen during training or even in data altered by image filtering operations. For this 
reason, Xuan et al. (2019) used an image pre-processing step in the training stage 
to remove artifacts of a specific GAN architecture. The same idea was exploited 
in (Hulzebosch et al. 2020) to improve the accuracy in real-world scenarios, where 
the particularities of the data (e.g. image compression) and the generator architecture 
are not known. Liu et al. (2020) observed that the texture of fake faces is substantially 
different from the real ones. Based on this observation, the authors devised a novel 
block to be added to the backbone of a CNN, the Gram-Block, which is capable of 
extracting global image texture features and improve the generalisation of the model 
against data generated by GAN architectures not used during training. Similarly, 
Yu et al. (2020a) introduced a novel convolution operator intended for separately 
processing the low- and high-frequency information of the image, improving the 
capability to detect the patterns of synthetic data available in the high-frequency 
band of the images. Finally, Wang et al. (2020a) studied the topic of generalisation to 
unseen datasets. For this, they collected a dataset consisting of fake images generated 
by 11 different CNN-based image generator models and concluded that the correct 
combination of pre-processing and data augmentation techniques allows a standard 
image classifier to generalise to unseen dataset even when trained with data obtained 
from a single GAN architecture. 

To summarise this section, we conclude that state-of-the-art automatic detection 
systems against face synthesis manipulation have excellent performance, mostly 
because they are able to learn the GAN fingerprints present in the images. However, 
it is also clear that the dependence on the model fingerprint affects the generability 
and the reliability of the model, e.g. when presented with adversarial attacks (Gandhi 
and Jain 2020). 


8.3 GAN Fingerprint Removal: GANprintR 


GANprintR was originally presented in (Neves et al. 2020) and aims at transform- 
ing synthetic face images, such that their visual appearance is unaltered but the 
GAN fingerprints (the discriminative information that permits the distinction from 
real imagery) are removed. Considering that the fingerprints are high-frequency sig- 
nals (Marra et al. 2019a), we hypothesised that their removal could be performed by 
an autoencoder, which acts as a non-linear low-pass filter. We claimed that by using 
this strategy, the detection capability of state-of-the-art facial manipulation detection 
methods significantly decreases, while at the same time humans still are not capable 
of perceiving that images were transformed. 


° 
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Training Data: 
Real Face Images 


Development Phase 


112x112x32 


æ 


fr Face Image after GANprintR 


Synthetic Synthetic Face Image 


Evaluation Phase 


Fig. 8.2 GAN-fingerprint Removal module (GANprintR) based on a convolutional AutoEn- 
coder (AE). The AE is trained using only real face images from the development dataset. In the 
evaluation stage, once the autoencoder is trained, we can pass synthetic face images through it to 
provide them with additional naturalness, in this way removing the GAN-fingerprint information 
that may be present in the initial fakes 


In general, an autoencoder comprises two distinct networks, encoder w and 
decoder y: 


y: X Mi: (8.1) 

yilRX, 
where X denotes the input image to the network, / is the latent feature representation 
of the input image after passing through the encoder y, and X’ is the reconstructed 
image generated from /, after passing through the decoder y. The networks y and 
y can be learned by minimising the reconstruction loss Ly ,(X, X’) = ||X — X'| |2 
over a development dataset following an iterative learning strategy. 

As result, when £ is nearly 0, w is able to discard all redundant information 
from X and code it properly into /. However, for a reduced size of the latent feature 
representation vector, £ will increase and y will be forced to encode in / only the 
most representative information of X. We claimed that this kind of autoencoder acts 
as a GAN-fingerprint removal system. 

Figure 8.2 describes the GANprintR architecture based on a convolutional AutoEn- 
coder (AE) composed of a sequence of 3 x3 convolutional filters, coupled with RELU 
activation functions. After each convolutional layer, a2 x 2 max-pooling layer is used 
to progressively decrease the size of the activation map to 28x28 x8, which repre- 
sents the bottleneck of the reconstruction model. 

The AE is trained with images from a public dataset that comprises face imagery 
from real persons. In the evaluation phase, the AE is used to generate improved fakes 
from input fake faces where GAN “fingerprints”, if present in the initial fakes, will 
be reduced. The main rationale of this strategy is that by training with real images 
the AE can learn the core structure of this type of natural data, which can then be 
exploited to improve existing fakes. 
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CASIA-WebFace (Real) 


TPDNE (Synthetic) 


Fig. 8.3 Examples of the databases considered in the experiments of this chapter after applying 
the pre-processing stage described in Sect. 8.5.1 
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8.4 Databases 


Four different public databases and one generated are considered in the experimental 
framework of this chapter. Figure 8.3 shows some examples of each database. We 
now summarise the most important features. 


8.4.1 Real Face Images 


e CASIA-WebFace: this database contains 494,414 face images from 10,575 actors 
and actresses of IMDb. Face images comprise random pose variations, illumina- 
tion, facial expression and resolution. 

e VGGFace2: this database contains 3,31 million images from 9,131 different sub- 
jects, with an average of 363 images per subject. Images were downloaded from 
the Internet and contain large variations in pose, age, illumination, ethnicity and 
profession (e.g. actors, athletes, and politicians). 


8.4.2 Synthetic Face Images 


TPDNE: this database comprises 150,000 unique faces, collected from the web- 
site.> Synthetic images are based on the recent StyleGAN approach (Karras et al. 
2019) trained with FFHQ database (Flickr-Faces-HQ 2019). 

e 100K-Faces: this database contains 100,000 synthetic images generated using 
StyleGAN (Karras et al. 2019). In this database the StyleGAN network was trained 
using around 29,000 photos of 69 different models, producing face images with a 
flat background. 

PGGAN: this database comprises 80,000 synthetic face images generated using the 
PGGAN network. In particular, we consider the publicly available model trained 
using the CelebA-HQ database. 


8.5 Experimental Setup 


This section describes the details of the experimental setup followed in the experi- 
mental framework of this chapter. 


3 https://thispersondoesnotexist.com. 
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8.5.1 Pre-processing 


In order to ensure fairness in our experimental validation, we created a curated version 
of all the datasets where the confounding variables were removed. Two different 
factors were considered in this chapter: 


Background: this is a clearly distinctive aspect among real and synthetic face 
images as different acquisition conditions are considered in each database. 

Head pose: images generated by GANs hardly ever produce high variation from 
the frontal pose (Dang et al. 2020), contrasting with most popular real face 
databases such as CASIA-WebFace and VGGFace2. Therefore, this factor may 
falsely improve the performance of the detection systems since non-frontal images 
are more likely to be real faces. 


To remove these factors from both the real and synthetic images, we extracted 68 
face landmarks, using the method described in (Kazemi and Sullivan 2014). Given 
the landmarks of the eyes, an affine transformation was determined such that the 
location of the eyes appears in all images at the same distance from the borders. This 
step allowed to remove all the background information of the images while keeping 
the maximum amount of the facial regions. Regarding the head pose, landmarks were 
used to estimate the pose (frontal vs. non-frontal). In the experimental framework of 
this chapter, we kept only the frontal face images, in order to avoid biased results. 
After this pre-processing stage, we were able to provide images of constant size 
(224x224 pixels) as input to the systems. Figure 8.3 shows examples of the crop- 
out faces of each database after applying the pre-processing steps. The synthetic 
images obtained by this pre-processing stage are the ones used to create the database 
iFakeFaceDB after being processed by the GANprintR approach. 


8.5.2 Facial Manipulation Detection Systems 


Three different state-of-the-art manipulation detection approaches are considered in 
this chapter. 

(1) XceptionNet (Chollet 2017): this network was selected, essentially because it 
provides the best detection results in the most recently published studies (Dang et al. 
2020; Rössler et al. 2019; Dolhansky et al. 2019). We followed the same training 
approach considered in (Rössler et al. 2019): (i) the model was initialised with the 
weights obtained after training with the ImageNet dataset (Deng et al. 2009), (ii) we 
changed the last fully-connected layer of the ImageNet model by a new one (two 
classes, real or synthetic image), (iii) we fixed all weights up to the final layers and 
pre-trained the network for few epochs, and finally (iv) we trained the network for 
20 more epochs and chose the best performing model based on validation accuracy. 

(2) Steganalysis (Nataraj et al. 2019b): the method by Nataraj et al. was selected 
for providing an approach based on steganalysis, rather than directly extracting fea- 
tures from the images, as in the XceptionNet approach. In particular, this approach 


8 GAN Fingerprints in Face Image Synthesis 189 


calculates the co-occurrence matrices directly from the image pixels on each chan- 
nel (red, green and blue), and passes this information through a custom CNN, which 
allows the network to extract non-linear robust features. Considering that the source 
code is not available from the authors, we replicated this technique to perform our 
experiments. 

(3) Local Artifacts (Matern et al. 2019): we have chosen the method of Matern et 
al., because it provides an approach based on the direct analysis of the visual facial 
artifacts, in opposition to the remaining approaches that follow holistic strategies. In 
particular, the authors of that work claim that some parts of the face (e.g. eyes, teeth, 
facial contours) provide useful information about the authenticity of the image, and 
thus train a classifier to distinguish between real and synthetic face images using 
features extracted from these facial regions. 

All our experiments were implemented under a PyTorch framework, with a 
NVIDIA Titan X GPU. The training of the Xception network was performed using 
the Adam optimiser with a learning rate of 107°, dropout for model regularisation 
with a rate of 0.5, and a binary cross-entropy loss function. Regarding the steganal- 
ysis approach, we reused the parameters adopted for Xception network, since the 
authors of (Nataraj et al. 2019b) did not detail the training strategy adopted. Regard- 
ing the local artifacts approach, we adopted the strategy for detecting “generated 
faces”, where a k-nearest neighbour classifier was used to distinguish between real 
and synthetic face images based on eye colour features. 


8.5.3 Protocol 


The experimental protocol designed in this chapter aims at performing an exhaus- 
tive analysis of the state-of-the-art facial manipulation detection systems. As such, 
three different experiments were considered: (i) controlled scenarios, (ii) in-the-wild 
scenarios, and (iii) GAN-fingerprint removal. 

Each database was divided into two disjoint datasets, one for the development of 
the systems (70%) and the other one for evaluation purposes (30%). Additionally, 
the development dataset was divided into two disjoint subsets, training (75%) and 
validation (25%). The same number of real and synthetic images were considered 
in the experimental framework. In addition, for real face images, different users 
were considered in the development and evaluation datasets, in order to avoid biased 
results. 

The GANprintR approach was trained during 100 epochs, using the Adam opti- 
mizer with a learning rate of 107°, and a mean square error (MSE) to obtain the 
reconstruction loss. To ensure an unbiased evaluation, GANprintR was trained with 
images from the MS-Celeb dataset (Guo et al. 2016), since it is disjoint from the 
datasets used in the development and evaluation of all the fake detection systems 
used in our experiments. 
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8.6 Experimental Results 


This section describes the results achieved in the experimental framework of this 
chapter. 


8.6.1 Controlled Scenarios 


In this section, we report the results of the detection of entire face synthesis in 
controlled scenarios, i.e. when samples from the same databases were considered for 
both development and final evaluation of the detection systems. This is the strategy 
commonly used in most studies, typically resulting in very good performance (see 
Sect. 8.2). 

A total of six experiments were carried out: A.1 to A.6. Table 8.2 describes the 
development and evaluation databases considered in each experiment together with 
the corresponding final evaluation results in terms of EER. Additionally, we represent 
in Fig. 8.4 the evolution of the loss/accuracy of the XceptionNet and Steganalysis 
detection systems for Exp. A.1. 

The analysis of Fig. 8.4 shows that both XceptionNet and Steganalysis approaches 
were able to learn discriminative features to detect between real and synthetic face 
images. The training process was faster for the XceptionNet detection system com- 
pared with Steganalysis, converging to a lower loss value in fewer epochs (close 
to zero after 20 epochs). The best validation accuracy achieved in Exp. A.1 for the 
XceptionNet and Steganalysis approaches were 99% and 95%, respectively. Similar 
trends were observed for the other experiments. 

We now analyse the results included in Table8.2 for experiments A.1 to A.6. 
Analysing the results obtained by the XceptionNet system, almost ideal performance 
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Fig. 8.4 Exp. A.1: Evolution of the loss/accuracy with the number of epochs 
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is achieved with EER values less than 0.5%. These results are in agreement to previous 
studies in the topic (see Sect. 8.2), pointing for the potential of the XceptionNet model 
in controlled scenarios. Regarding the Steganalysis approach, a higher degradation of 
the system performance is observed, when compared with the XceptionNet approach, 
especially for the 100K-Face database, e.g. a 16% EER is obtained in Exp. A.5. 
Finally, it can be observed that the approach based on local artifacts was the least 
efficient to spot the differences between real and synthetic data, with an average 
35.5% EER over all experiments. 

In summary, for controlled scenarios XceptionNet has excellent manipulation 
detection accuracies, then Steganalysis provides good accuracies, and finally Local 
Artifacts have poor accuracy. In the next section we will see the limitations of these 
techniques in-the-wild. 


8.6.2 In-the-Wild Scenarios 


This section evaluates the performance of the facial manipulation detection systems 
in more realistic scenarios, i.e. in-the-wild. The following aspects are considered: 
(i) different development and evaluation databases, and (ii) different image reso- 
lution/blur among the development and evaluation of the models. This last point is 
particularly important, as the quality of raw images/videos is usually modified when, 
e.g. they are uploaded to social media. The effect of image resolution has been pre- 
liminary analysed in previous studies (Rossler et al. 2019; Korshunov and Marcel 
2018), but for different facial manipulation groups, i.e. face swapping/identity swap 
and facial expression manipulation. The main goal of this section is to analyse the 
generalisation capability of state-of-the-art entire face synthesis detection in uncon- 
strained scenarios. 

First, we focus on the scenario of considering the same real but different syn- 
thetic databases in development and evaluation (Exp. B.1, B.2, B.5, B.6, and so on, 
provided in Table 8.2). In general, the results achieved in the experiments evidence 
a high degradation of the detection performance regardless of the facial manipula- 
tion detection approach. For the XceptionNet, the average EER is 11.2%, i.e. over 
20 times higher than the results achieved in Exp. A.1—A.6 (<0.5% average EER). 
Regarding the Steganalysis approach, the average EER is 32.5%, i.e. more than 3 
times higher than the results achieved in Exp. A.1-A.6 (9.8% average EER). For 
Local Artifacts, the observed average EER was 42.4%, with an average worsening 
of 19%. The large degradation of the first two detectors suggests that they might 
rely heavily on the GAN fingerprints of the training data. This result confirms the 
hypothesis that different GAN models produce different fingerprints, as also men- 
tioned in previous studies (Yu et al. 2018). Moreover, these results suggest that these 
GAN fingerprints are the information used by the detectors to distinguish between 
real and synthetic data. 

Table 8.2 also considers the case of using different real and synthetic databases 
for both development and evaluation (Exp. B.3, B.4, B.7, B.8, etc.). In this scenario, 
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an average EERs of 9.3%, 32.3% and 42.3% in fake detection were obtained for 
XceptionNet, Steganalysis and Local Artifacts, respectively. When comparing these 
results with the EERs of the previous experiments (where only the synthetic evalua- 
tion set was changed), no significant gap in performance was found, which points that 
the change of synthetic data might be the main cause for performance degradation. 

Finally, we also analyse how different image transformations affect facial manip- 
ulation detection systems. In this analysis, we focus only on the XceptionNet model 
as it provides much better results when compared with the remaining detection sys- 
tems. For each baseline experiment (A.1 to A.6), the evaluation set (both real and 
fake images) was transformed by: (i) resolution downsizing (1/3 of the original res- 
olution), (ii) a low-pass filter (9 x 9 Gaussian kernel, o = 1.7), and (iii) jpeg image 
compression using a quality level of 60. The resulting EER together with the Recall, 
PSRN and SSIM values are provided in Table 8.3, together with the performance of 
the original images. The results suggest a high performance degradation in all exper- 
iments, proving the vulnerability of the fake detection system to unseen conditions, 
even if they result from simple image transformations. 

To further understand the impact of these transformations, we evaluated an 
increasing downsize ratio in the performance of the fake detection system. Figure 8.5 
depicts the detection performance results in terms of EER (%), from lower to higher 
modifications of the image resolution. In general, we can observe increasingly higher 
degradation of the fake detection performance for decreasing resolution. For exam- 
ple, when the image resolution is reduced by 1/4, the average EER increases 6% 
when compared with the raw image resolution (raw equals to 1/1). This performance 
degradation is even higher when we further reduce the image resolution, with EERs 
(%) higher than 15%. These results support the conclusion about a poor generali- 
sation capacity of state-of-the-art facial manipulation detection systems to unseen 
conditions. 


8.6.3 GAN-Fingerprint Removal 


This section analyses the results of the strategy for GAN-fingerprint Removal (GAN- 
printR). We evaluated to what extent our method is capable of spoofing state-of-the- 
art facial manipulation detection systems by improving fake images already obtained 
with some of the best and most realistic known methods for entire face synthesis. 
For this, the experiments A.1 to A.6 were repeated for the XceptionNet detection 
system, but the fake images of the evaluation set were transformed after passing 
through GANprintR. 

Table 8.3 provides the results achieved for both the original fake data and after 
GANprintR. The analysis of the results shows that GANprintR obtains higher fake 
detection error than the remaining attacks, while maintaining a similar or even better 
visual quality. In all the experiments, the EER of the manipulation detection increases 
when using GANprintR to transform the synthetic face images. Also, the detection 
degradation is higher than other types of attacks for similar PSNR values and slightly 
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Table 8.3 Comparison between the GANprintR approach and typical image manipulations. 
The detection performance is provided in terms of EER (%) for experiments A.1 to A.6, when using 
different versions of the evaluation set. TDE stands for transformation of the evaluation data and 
details the technique used to modify the test set before fake detection. R,eq; and R fake denote the 


Recall of the real and fake classes, respectively, 


Experiment | TDE EER (%) | Rreal (%) | XceptionNet 
R fake (%) | PSNR SSIM 
(db) 

Al Original 0.22 99.77 99.80 - 
Downsize 1.17 98.83 98.87 35.55 
Low-pass filter 0.83 99.17 99.20 34.63 
jpeg compression 1.53 98.47 98.50 36.02 
GANprintR 10.63 89.37 89.40 35.01 

A.2 Original 0.28 99.70 99.73 - 
Downsize 0.87 99.13 99.17 36.24 
Low-pass filter 2.87 97.10 97.13 35.22 
jpeg compression 1.83 98.17 98.20 36.76 
GANprintR 6.37 93.64 93.66 35.59 

A.3 Original 0.02 99.97 100.00 - 
Downsize 3.70 96.27 96.30 34.85 
Low-pass filter 1.53 98.43 98.47 34.10 
jpeg compression 30.93 69.04 69.06 35.85 
GANprintR 17.27 82.71 82.73 34.82 

A.4 Original 0.02 99.97 100.00 - 
Downsize 1.00 98.97 99.00 35.55 
Low-pass filter 0.07 99.90 99.93 34.63 
jpeg compression 2.50 97.47 97.50 36.02 
GANprintR 4.47 95.50 95.53 35.01 

AS5 Original 0.08 99.90 99.93 - 
Downsize 6.27 93.70 93.73 36.24 
Low-pass filter 11.53 88.44 88.46 35.22 
jpeg compression 3.27 96.73 96.77 36.76 
GANprintR 11.47 88.50 88.53 35.59 

A.6 Original 0.05 99.93 99.97 - - 
Downsize 7.77 92.24 92.26 34.85 0.91 
Low-pass filter 2.10 97.90 97.93 34.10 0.90 
jpeg compression 5.37 94.64 94.66 35.85 0.96 
GANprintR 8.37 91.64 91.66 34.82 0.95 
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Fig.8.5 Robustness of the fake detection system regarding the image resolution. The Xcep- 
tionNet model is trained with the raw image resolution and evaluated with lower image resolutions. 
Note how the EER increases significantly while reducing the image resolution 


higher values of SSIM. In particular, the average EER when considering GANprintR 
is 9.8%, i.e. over 20 times higher than the results achieved when using the original 
fakes (<0.5% average EER). This suggests that our method is not simply removing 
high-frequency information (evidenced by the comparison with the low-pass filter 
and downsize) but it is also removing the GAN fingerprints from the fakes improving 
their naturalness. It is important to remark that different real face databases were 
considered for training the face manipulation detectors and our GANprintR module. 

In addition, we provide in Fig. 8.6 an analysis of the impact of the latent feature 
representation of the autoencoder in terms of EER and PSNR. In particular, we follow 
the experimental protocol considered in Exp. A.3, and calculate the EER of Xcep- 
tionNet for detecting fakes improved with various configurations of GANprintR. 
Moreover, the PSNR for each set of transformed images is also included in Fig. 8.6 
together with a face example of each configuration to visualise the image quality. 
The face examples included in Fig. 8.6 show no substantial differences between the 
original fake and the resulting fakes after GANprintR for the different latent fea- 
ture representation size of the GANprintR, which is confirmed by the tight range of 
PSNR values obtained along the different latent feature representations. The EER 
values of fake detection significantly increase as the size of latent feature represen- 
tations diminish, evidencing that GANprintR is capable of spoofing state-of-the-art 
detectors without significantly degrading the visual aspect of the image. 
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Fig. 8.6 Robustness of the fake detection system after GAN-fingerprint Removal (GAN- 
printR). The latent feature representation size of the AE is varied to analyse the impact on both 
system performance and visual aspect of the reconstructed images. Note how the EER increases sig- 
nificantly when considering GANprintR spoof approach, while maintaining a high visual similarity 
with the original image 


Finally, to confirm that GANprintR is actually removing the GAN-fingerprint 
information and not just reducing the image resolution of the images, we performed 
a final experiment where we trained the XceptionNet for fake detection considering 
different levels of image resolution, and then tested it using fakes improved with 
GANprintR. Figure 8.7 shows the fake detection performance in terms of EER for 
different sizes of the latent feature representation of GANprintR. Five different GAN- 
printR configurations are tested per image resolution. The obtained results point for 
the stability of EER values with respect to downsized synthetic images in training, 
concluding that GANprintR is actually removing the GAN-fingerprint information. 


8.6.4 Impact of GANprintR on Other Fake Detectors 


For completeness, we provide in this section a comparative analysis between the 
impact of the GANprintR approach on the three state-of-the-art manipulation detec- 
tion approaches considered in this chapter. Table 8.4 reports the EER and Recall 
observed when using the original images and when using the modified version of the 
same images. 

In Sect. 8.6.1 it has been concluded that XceptionNet stands out as the most reliable 
approach at recognising synthetic faces. The analysis of Table 8.4 evidences that this 
conclusion also holds when using images transformed by GANprintR. Nevertheless, 
itis also interesting to analyse the performance degradation caused by the GANprintR 
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Fig. 8.7 Robustness of the fake detection system trained with different resolutions and then 
tested with fakes improved with GANprintR under various configurations (representation 
sizes). Five different GANprintR configurations are tested per image resolution level. The results 
observed point for the stability of EER values with respect to using downsized synthetic images in 
training. This observation supports the conclusion that GANprintR is actually removing the GAN 
fingerprints. 


approach. The average number of percentage points that the EER has increased for 
XceptionNet, Steganalysis and Local Artifacts is 9.65, 14.68 and 4.91, respectively. 
Even though, in this case, the work of Matern et al. (2019) stands out for having 
the lowest performance degradation, we believe that this is primarily due to the high 
EER achieved in the original set of images. 


8.7 Conclusions and Outlook 


This chapter has covered the topic of GAN fingerprints in face image synthesis. We 
have first provided an in-depth literature analysis of the most popular GAN synthesis 
architectures and fake detection techniques, highlighting the good fake detection 
results achieved by most approaches due to the “fingerprints” inserted in the GAN 
generation process. 

In addition, we have reviewed a recent approach to improve the naturalness 
of facial fake images and spoof state-of-the-art fake detectors: GAN-fingerprint 
Removal (GANprintR). GANprintR was originally presented in Neves et al. (2020) 
and is based on a convolutional autoencoder. The autoencoder is trained using only 
real face images from the development dataset. In the evaluation stage, once the 
autoencoder is trained, we can pass synthetic face images through it to provide them 
with additional naturalness, in this way removing the GAN-fingerprint information 
that may be present in the initial fakes. 

A thorough experimental assessment of this type of facial manipulation has been 
carried out considering fake detection (based on holistic deep networks, steganalysis, 
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and local artifacts) and realistic GAN-generated fakes (with and without GANprintR) 
over different experimental conditions, i.e. controlled and in-the-wild scenarios. We 
highlight three major conclusions about the performance of the state-of-the-art fake 
detection methods: (i) the existing fake systems attain almost perfect performance 
when the evaluation data is derived from the same source used in the training phase, 
which suggests that these systems have actually learned the GAN “fingerprints” from 
the training fakes generated with GANs; (ii) the observed fake detection performance 
decreases substantially (over one order of magnitude) when the fake detection is 
exposed to data from unseen databases, and over seven times in case of substantially 
reduced image resolution; and (iii) the accuracy of the existing fake detection methods 
also drops significantly when analysing synthetic data manipulated by GANprintR. 

In summary, our experiments suggest that the existing facial fake detection meth- 
ods still have a poor generalisation capability and are highly susceptible to—even 
simple—image transformation manipulations, such as downsizing, image compres- 
sion or others similar to the one proposed in this work. While loss of resolution 
may not be particularly concerning in terms of the potential misuse of the data, it 
is important to note that approaches such as GANprintR are capable of confound- 
ing detection methods, while maintaining a high visual similarity with the original 
image. 

Having shown some of the limitations of the state-of-the-art in face manipulation 
detection, future work should research about strategies to harden such face manipu- 
lation detectors by exploiting databases such as iFakeFaceDBiFakeFaceDB.* Addi- 
tionally, further works should study: (i) how improved fakes obtained in similar ways 
as GANprintR can jeopardise other kinds of sensitive data (e.g. other popular bio- 
metrics like fingerprint (Tolosana et al. 2020a), iris (Proença and Neves 2019), or 
behavioural traits (Tolosana et al. 2020b)), (ii) how to improve the security of systems 
dealing with other kinds of sensitive data (Hernandez-Ortega et al. 2021), and finally 
(iii) best ways to combine multiple manipulation detectors (Tolosana et al. 2021) in 
a proper way (Fiérrez et al. 2018) to deal with the growing sophistication of fakes. 
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Physics-based methods anchor the forensic analysis in the physical laws of image 
and video formation. The analysis is typically based on simplifying assumptions to 
make the forensic analysis tractable. In scenes that satisfy such assumptions, different 
types of forensic analysis can be performed. The two most widely used applications 
are the detection of content repurposing and content splicing. Physics-based methods 
expose such cases with assumptions about the interaction of light and objects, and 
about the geometric mapping of light and objects onto the image sensor. 

In this chapter, we review the major lines of research on physics-based methods. 
The approaches are categorized as geometric and photometric, and combinations of 
both. We also discuss the strengths and limitations of these methods, including an 
interesting unique property: most physics-based methods are quite robust to low- 
quality material, and can even be applied to analog photographs. The chapter closes 
with an outlook and with links to related forensic techniques such as the analysis of 
physiological signals. 


9.1 Introduction 


Consider a story that is on the internet. Let us also assume that an important part of 
that story is a picture or video to document the story. If this story goes viral, it becomes 
potentially relevant also to classical journalists. Relevance may either inherently be 
given, because the story reports a controversial event of public interest, or relevance 
may emerge from the sole fact that a significant number of people are discussing a 
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particular topic. Additionally, quality journalism also requires confidence in the truth 
value that is associated with a news story. Hence, classical journalistic work involves 
the work of eye witnesses and other trusted sources of documentation. However, 
stories that emerge from internet sources are oftentimes considerably more difficult 
to verify for a number of reasons. One reason may be because the origin of the 
story is difficult to determine. Another reason may be that reports are difficult to 
verify because involved locations and people are inaccessible, as it is oftentimes 
the case for reports from civil war zones. In these cases, a new specialization of 
journalistic research formed in the past years, namely journalistic fact checking. The 
techniques of fact checkers will be introduced in Sect.9.1.1. We will note that the 
class of physics-based methods in multimedia forensics is closely related to these 
techniques. More precisely, physics-based approaches can be seen as computational 
support tools that smoothly integrate into the journalistic fact checking toolbox, as 
highlighted in Sect.9.1.2. We close this section with an outline for the remaining 
chapter in Sect. 9.1.3. 


9.1.1 Journalistic Fact Checking 


Classical journalistic research about a potential news story aims to answer the so- 
called “Five W” questions who, what, why, when, and where about an event, i.e., 
who participated in the event, what happened, why, and at what time and place it 
happened. The answers to these questions are typically derived and verified from 
contextual information, witnesses, and supporting documents. Investigative journal- 
ists foster a broad network of contacts and sources to verify such events. However, 
for stories that emerge on the internet, this classical approach can be of limited effec- 
tivity, particularly if the sources are anonymized or covered by social media filters. 
Journalistic fact checkers still heavily rely on classical techniques, and, for exam- 
ple, investigate the surroundings in social networks to learn about possible political 
alignments and actors from where a message may have originated. For images or 
videos, context is also helpful. 

One common issue with such multimedia content is repurposing. This means that 
the content itself is authentic, but it has been acquired at a different time or a different 
place than what is claimed in the associated story. To find cases of repurposing, the 
first step is a reverse image search as it is possible with specialized search engines 
like tineye.comor images.google.com. 

Furthermore, a number of open-source intelligence tools are available for a further 
in-depth verification of the image content. Depending on the shown scene and the 
context of the news story, photogrammetric methods allow to validate time and place 
from the position of the sun and known or estimated shadow lengths of buildings, 
vehicles, or other known objects. Any landmarks in the scene, written text, or signs 
can be further used to constrain the possible location of acquisition, combined with, 
e.g., regionally annotated satellite maps. Another type of open-source information 
is monitoring tools for flights and shipping lines. Such investigations require a high 
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manual work effort to collect and organize different pieces of evidence and to assess 
their individual credibility. In Journalistic practice, however, these approaches are 
currently the gold standard to verify multimedia content from sources with very 
limited contextual information. Unfortunately, this manual effort also implies that an 
investigation may take several days or weeks. While it would oftentimes be desirable 
to quell wrong viral stories early on, this is in many cases not possible with current 
investigative tools and techniques. 


9.1.2 Physics-Based Methods in Multimedia Forensics 


In multimedia forensics, the group of physics-based methods follows the same spirit 
as journalistic fact checking. Like the journalistic methods, they operate on the con- 
tent of the scene. They also use properties of known objects, and some amount of 
outside knowledge, such as the presence of a single dominant light source. However, 
there are two main differences. First, physics-based methods in multimedia foren- 
sics operate almost exclusively on the scene content. They use much less contextual 
knowledge, since this contextual information is strongly dependent on the actual 
case, which makes it difficult to address in general-purpose algorithms. Second, 
journalistic methods focus on validating time, place, and actors, while physics-based 
methods in multimedia forensics aim to verify that the image is authentic, i.e., that 
the scene content is consistent with the laws of physics. 

The second difference makes physics-based forensic algorithms complementary 
to journalistic methods. They can add specific evidence whether an image is a com- 
posite from multiple sources. For example, if an image shows two persons on a free 
field who are only illuminated by the sun, then one can expect that both people are 
illuminated from the same direction. More subtle cues can be derived from the laws 
of perspective projection. This applies to all common acquisition devices, since it is 
the foundation of imaging with a pinhole camera. The way how light is geometri- 
cally mapped onto the sensor is tied to the camera and to the location of objects in 
the scene. In several cases, the parameters for this mapping can be calculated from 
specific objects. Two objects in an image with inconsistent parameters can indicate 
an image composition. Similarly, it is possible to compare the color of incident light 
on different objects, or the color of ambient light in shadow areas. 

To achieve this type of analysis, physics-based algorithms use methods from 
related research fields, most notably from computer vision and photometry. Typi- 
cally, the analysis analytically solves an assumed physical model for a quantity of 
interest. This has several important consequences, which distinguish physics-based 
approaches from statistical approaches in multimedia forensics. First, physics-based 
methods typically require an analyst to validate the model assumptions, and to per- 
form a limited amount of annotations in the scene to access the known variables. In 
contrast, statistical approaches can in most cases work fully automatically, and are 
hence much better suitable for batch processing. Second, the applied physical models 
and analysis methods can be explicitly checked for their approximation error. This 
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makes physics-based methods inherently explainable, which is an excellent property 
for defending a decision based on a physics-based analysis. The majority of physics- 
based approaches do not use advanced machine learning methods, since this would 
make the explainability of the results much more complicated. Third, physics-based 
algorithms require specific scene constellations in order to be applicable. For exam- 
ple, an analysis that assumes that the sun is the only light source in the scene is not 
applicable to indoor photographs or night time pictures. Conversely, the focus on 
scene content in conjunction with manual annotations by analysts makes physics- 
based methods by design extremely robust to the quality or processing history of an 
image. This is a clear benefit over statistical forensic algorithms, since their perfor- 
mance quickly deteriorates on images of reduced quality or complex post-processing. 
Some physics-based methods can even be applied to printouts, which is also not pos- 
sible for statistical methods. 

With these three properties, physics-based forensic algorithms can be seen as 
complementary to statistical approaches. Their applicability is limited to specific 
scenes and manual interactions, but they are inherently explainable and very robust 
to the processing history of an image. 

Closely related to physics-based methods are behavioral cues of persons and 
physiological features as discussed in Chap. 11 of this book. These methods are not 
physics-based in the strict sense, as they do not use physical models. However, these 
methods share with physics-based methods the property that they operate on the 
scene content, and offer as such in many cases also resilience to various processing 
histories. 


9.1.3 Outline of This Chapter 


A defining property of physics-based methods is the underlying analytic models and 
their assumptions. The most widely used models will be introduced in Sect. 9.2. The 
models are divided into two parts: geometric and optical models are introduced in 
Sect.9.2.1, and photometric and reflectance models are introduced in Sect.9.2.2. 
Applications of these models in forensic algorithms are presented in Sect. 9.3. This 
part is subdivided into geometric methods, photometric methods, and combinations 
thereof. This chapter concludes with a discussion on the strength and weaknesses of 
physics-based methods and an outlook on emerging challenges in Sect. 9.4. 


9.2 Physics-Based Models for Forensic Analysis 


Consider a photograph of a building, for example, the Hagia Sophia in Fig. 9.1. The 
way how this physical object is converted to a digital image is called image formation. 
Three aspects of the image formation are particularly relevant for physics-based 
algorithms. These three aspects are illustrated below. 
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Fig. 9.1 Example photograph “Turkey-3019—Hagia Sophia” (full picture credits at the end of 
chapter) 


First, landmark points like the tips of the towers are projected onto the camera. 
For all conventional cameras that operate with a lens, the laws of perspective projec- 
tion determine the locations to which these landmarks are projected onto the sensor. 
Second, the dome of the building is bright where the sun is reflected into the cam- 
era, and darker at other locations. Brightness and color variations across pixels are 
represented with photometric models. Third, the vivid colors and, overall, the final 
representation of the image are obtained from the several camera-internal processing 
functions, which computationally perform linear and non-linear operations on the 
image. 

The first two components of the image formation are typically used for a physics- 
based analysis. We introduce the foundation for the analysis of geometric properties 
of the scene in Sect. 9.2.1, and we introduce foundations for the analysis of reflectance 
properties of the scene in Sect. 9.2.2. 


9.2.1 Geometry and Optics 


The use of geometric laws in image analysis is a classical topic from the field of 
computer vision. Geometric algorithms typically examine the relation between the 
location of 3-D points in the scene and their 2-D projection onto the image. In 
computer vision, the goal is usually to infer a consistent 3-D structure from one or 
more 2-D images. One example is the stereo vision to calculate depth maps for an 
object or scene from two laterally shifted input images. In multimedia forensics, the 
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Fig. 9.2 The mapping between world- and camera coordinates is performed with the extrinsic 
camera parameters (left). The mapping from the 3-D scene onto the 2-D image plane is performed 
with the intrinsic camera parameters (right) 


underlying assumption of a geometric analysis is that the geometric consistency of 
the scene is violated by an inserted object or by certain types of object editing. 

For brevity of notation, let us denote a point in the image I at coordinate (y, x) as 
x = I (y, x). This point corresponds to a 3-D point in the world at the time when the 
image was taken. In the computer vision literature, this point is typically denoted as 
X, but here it is denoted as X to avoid notational confusion with matrices. Vectors that 
are used in projective geometry are typically written in homogeneous coordinates. 
This also allows to conveniently express points at infinity. Homogeneous vectors 
have one extra dimension with an entry z}, such that the usual Cartesian coordinates 
are obtained by dividing each other entry by z}. Assume that we convert Cartesian 
coordinates to homogeneous coordinates, and choose z} = 1. Then, the remaining 
vector elements x; are identical to their corresponding Cartesian coordinates, since 
Xj / l= Xi. 

When taking a picture, the camera maps the 3-D point X from the 3-D world onto 
the 2-D image point x. Mathematically, this is a perspective projection, which can 
be very generally expressed as 


= K[RIt]x , (9.1) 


where x and x are written in homogeneous coordinates. This projection consists of 
two matrices: the matrix [R|t] € IR?** contains the so-called extrinsic parameters. It 
transforms the point x from an arbitrary world coordinate system to the 3-D coordi- 
nates of the camera. The matrix K € R**? contains the so-called intrinsic parameters. 
It maps these 3-D coordinates onto the 2-D image plane of the camera. 

Both mapping steps are illustrated in Fig. 9.2. The mapping of the world coordinate 
system to the camera coordinate system via the extrinsic camera parameters is shown 
on the left. The mapping of the 3-D scene onto the 2-D image plane is shown on the 
right. 

Both matrices have a special form. The matrix of extrinsic parameters [R|t] is a 
3 x 4 matrix. It consists of a 3 x 3 rotation matrix in the first three columns and a 
3 x 1 translation vector in the fourth column. The matrix of intrinsic parameters K 
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is an upper triangular 3 x 3 matrix 


kK=|o o], (9.2) 


where f, and f, are the focal length in pixels along x- and y-direction, (Cx, cy) is 
the camera principal point, and s denotes pixel skewness. The camera principal point 
(oftentimes also called camera center) is particularly important for forensic applica- 
tions, as it marks the center of projection. The focal length is also of importance, as it 
changes the size of the projection cone from world coordinates to image coordinates. 
This is illustrated in the right part of Fig.9.2 with two example focal lengths: the 
smaller focal length f, maps the whole person onto the image plane, whereas the 
larger focal length f only maps the torso around the center of projection (green) 
onto the image plane. 

Forensic algorithms that build on top of these equations typically do not use the 
full projection model. If only a single image is available, and no further external 
knowledge on the scene geometry, then the world coordinate system can be aligned 
with the camera coordinate system. In this case, the x- and y-axes of the world 
coordinate system correspond to the x- and y-axes of the camera coordinate system, 
and the z-axis points with the camera direction into the scene. In this case, the matrix 
[R|t] can be omitted. Additionally, it is a common assumption to assume that the 
pixels are square, the lens is sufficiently homogeneous, and that the camera skew is 
negligible. These assumptions simplify the projection model to only three unknown 
intrinsic parameters, namely 


f 0 c, 
x=|ofecj|<x. (9.3) 
001 


These parameters are the focal length f and the center of projection (cx, cy), which 
are used in several forensic algorithms. 

One interesting property of the perspective projection is the so-called vanishing 
points: in the projected image, lines that are parallel in the 3-D world converge to 
a vanishing point on the image. Vanishing points are not necessarily visible within 
the shown scene but can also be outside of the scene. For example, a 3-D building 
typically consists of parallel lines in three mutually orthogonal directions. In this 
special case, the three vanishing points span an orthogonal coordinate system with 
the camera principal point at the center. 

One standard operation in projective geometry is homography. It maps points 
from one plane onto another. One example application is shown in Fig. 9.3. On the 
left, a perspectively distorted package of printer paper is shown. The paper is in A4 
format, and has hence a known ratio between height and width. After annotation 
of the corner points (red squares in the left picture), the package can be rectified to 
obtain a virtual frontal view (right picture). In forensic applications, we usually use 
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Fig. 9.3 Example homography. Left: rectification of an object with known aspect ratio (the A4 
paper) using four annotated corners (red dots). Right: rectified object after calculation of the homog- 
raphy from the distorted paper plane onto the world coordinate plane 


a mapping from a plane in the 3-D scene onto the camera image plane or vice versa 
as presented in Sect.9.3.1. For this reason, we review the homography calculation 
in the following paragraphs in greater detail. 

All points on a plane lie on a 2-D manifold. Hence, the mapping of a 2-D plane 
onto another 2-D plane is actually a transform of 2-D points onto 2-D points. We 
denote by X a point on a plane in homogeneous 2-D world coordinates, and a point 
in homogeneous 2-D image coordinates as x. Then, the homography is performed 
by multiplication with a 3 x 3 projection matrix H, 


x = Hf. (9.4) 


Here, the equality sign holds only up to the scale of the projection matrix. This scale 
ambiguity does not affect the equation, but it implies that the homography matrix 
H has only 8 degrees of freedom instead of 9. Hence, to estimate the homography 
matrix, a total of 8 constraints are required. These constraints are obtained from 
point correspondences. A point correspondence is a pair of two matching points, 
where one point lies on one plane and the second point on the other plane, and both 
points mark the same location on the object. The x-coordinates and the y-coordinates 
of a correspondence, each contribute one constraint, such that a total of four point 
correspondences suffices to estimate the homography matrix. 

The actual estimation is performed by solving Eq.9.4 for the elements h = 
(hii, ..., h33)! of H. This estimation is known as Direct Linear Transformation, and 
can be found in computer vision textbooks, e.g., by Hartley and Zisserman (2013, 
Chap. 4.1). The solution has the form 


Ah=0, (9.5) 


where A is a 2 - N x 9 matrix. Each of the N available point correspondences con- 
tributes two rows to A of the form 
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E: p 3⁄2 k— i. 


Fig.9.4 Distortions and aberrations from the camera optics. From left to right: image of arectangu- 
lar grid, and its image under barrel distortion and pincushion distortion, axial chromatic aberration 
and lateral chromatic aberration 


ax = (—X1, —J1, —1,0,0,0, x2.%1, x282, x2) (9.6) 


and 
axx+ı = (0, 0, 0, —%1, — 91, —1, yoX1, yoX2, y2) ` (9.7) 


Equation 9.5 can be solved via singular value decomposition (SVD). After the fac- 
torization into A = U £ VT, the unit singular vector that corresponds to the smallest 
singular value is the solution for h. 

The homography matrix implicitly contains the intrinsic and extrinsic camera 
parameters. More specifically, the RQ-decomposition factorizes a matrix into a prod- 
uct of an upper triangular matrix and an orthogonal matrix (Hartley and Zisserman 
2013, Appendix 4.1). Applied to the 3 x 3 matrix H, the 3 x 3 upper triangular 
matrix corresponds then to the intrinsic camera parameter matrix K. The orthogonal 
matrix corresponds to a 3 x 3 matrix [rır2t], where the missing third rotation vector 
of the extrinsic parameters is calculated as the cross-product r3 = ri X ro. 

The equations above implicitly assume a perfect mapping from points in the scene 
to pixels in the image. However, in reality, the camera lens is not perfect, which can 
slightly alter the pixel representation. We briefly state notable deviations from a per- 
fect mapping without going further into detail. Figure 9.4 illustrates these deviations. 
The three grids on the left illustrate geometric lens distortions, which change the map- 
ping of lines from the scene onto the image. Probably, the most well-known types are 
barrel distortion and pincushion distortion. Both types of distortions either stretch or 
shrink the projection radially around a center point. Straight lines appear then either 
stretched away from the center (barrel distortion) or contracted toward the center 
(pincushion distortion). While lens distortions affect the mapping of macroscopic 
scene elements, another form of lens imperfection is chromatic aberration, which is 
illustrated on the right of Fig. 9.4. Here, the mapping of a world point between the 
lens and image plane is shown. However, different wavelengths (and hence colors) 
are not mapped onto the same location on the sensor. The first illustration shows 
axial chromatic aberration, where individual color channels focus at different dis- 
tances. In the picture, only the green color channel converges at the image plane and 
is in focus; the other two color channels are slightly defocused. The second illus- 
tration shows lateral chromatic aberration, which displaces different wavelengths. 
In extreme cases, this effect can also be seen when zooming into a high-resolution 
photograph as a slight color seam at object boundaries that are far from the image 
center. 


216 C. Riess 


9.2.2 Photometry and Reflectance 


The use of photometric and reflectance models in image analysis is also a classical 
topic from the field of computer vision. These models operate on brightness or 
color distributions of pixels. If a pixel records the reflectance from the surface of a 
solid object, its brightness and color are a function of the object material, and the 
relative orientation of the surface patch, the light sources, and the camera in the 
scene. In computer vision, classical applications are, for example, intrinsic image 
decomposition for factorizing an object into geometry and material color (called 
albedo), photometric stereo and shape from shading for reconstructing the geometry 
of an object from the brightness distribution of its pixels, and color constancy for 
obtaining a canonical color representation that is independent of the color of the 
illumination. In multimedia forensics, the underlying assumption is that the light 
source and the camera induce global consistency constraints that are violated when 
an object is inserted or otherwise edited. 

The color and brightness of a single pixel I(y, x) are oftentimes modeled as 
irradiance of light that is reflected from a surface patch onto the camera lens. To this 
end, the surface patch itself must be illuminated by one or more light sources. Each 
light source is assumed to emit photons with a particular energy distribution, which 
is the spectrum of the light source. In human vision, the visible spectrum of a light 
source is perceived as the color of the light. The surface patch may reflect this light 
as diffuse or specular reflectance, or a combination of both. 

Diffuse reflectance, also called Lambertian reflectance, is by far the most com- 
monly assumed model. The light-object interaction is illustrated in the left part of 
Fig.9.5. Here, photons entering the object surface are scattered within the object, 
and then emitted in a random direction. Photons are much more likely to enter an 
object when they perpendicularly hit the surface than when they hit at a flatter angle. 
This is mathematically expressed as a cosine between the angle of incidence and the 
surface normal. The scattering within the object changes the spectrum toward the 
albedo of the object, such that the spectrum after leaving the object corresponds to 
the object color given the color of the light source. Since the photons are emitted in 
random directions, the perceived brightness is identical from all viewing directions. 

Specular reflectance occurs when photons do not enter the object, but instead are 
reflected at the surface. The light-object interaction is illustrated in the right part 
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Fig.9.5 Light-object interaction for diffuse (Lambertian) and specular reflectance 
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of Fig.9.5. Specular reflectance has two additional properties. First, the spectrum 
of the light is barely affected by this minimal interaction with the surface. Second, 
specular reflectance is not scattered in all directions, but only in a very narrow angle 
of exitance that is opposite to the angle of incidence. This can be best observed on 
mirrors, which exhibit almost perfect specular reflectance. 

The photometric formation of a pixel can be described by a physical model. Let 
i = I(y, x) be the usual RGB color representation at pixel I (y, x). We assume that 
this pixel shows an opaque surface object. Then, the irradiance of that pixel into the 
camera is quite generally described as 


i(0., u) = foam J [re Oe, v, à) - e(0i, à): ce (A) d0 dà . (9.8) 


0 À 


Here, 0; and 0. denote the 3-D angles under which a ray of light falls onto the surface 
and exits the surface, À denotes the wavelength of light, v is the normal vector of 
the surface point, e(0;, À) denotes the amount of light coming from angle 0; with 
wavelength i, r(0;, Oe, v, À) is the reflectance function indicating the fraction of 
light that is reflected in direction 0. after it arrived with wavelength à from angle 
0i, and c,, (À) models the camera’s color sensitivity in the three RGB channels to 
wavelength À. The inner expression integrates over all angles of incidence and all 
wavelengths to model all light that potentially leaves this surface patch in the direction 
of the camera. The function fam captures all further processing in the camera. For 
example, consumer cameras always apply a non-linear scaling called gamma factor 
to each individual color channel to make the colors appear more vivid. Theoretically, 
Jeam could also include additional processing that involves multiple pixels, such as 
demosaicking, white balancing, and lens distortion correction, although this is of 
minor concern for the algorithms in this chapter. 

The individual terms of Eq. 9.8 are illustrated in Fig. 9.6. Rays of light are emitted 
by the light source and arrive at a surface with term e(0;, A). Reflections from the 
surface with function r(0;, Oe, v, A) are filtered for their colors on the camera sensor 
with c;gp, and further processed in the camera with feam- 

For algorithm development, Eq.9.8 is in most cases too detailed to be useful. 
Hence, most components are typically neutralized by additional assumptions. For 
example, algorithms that focus on geometric arguments, such as the analysis of 
lighting environments, typically assume to operate on a grayscale image (or just a 
single-color channel), and hence remove all influences from the wavelength À and 
the color sensitivity cg) from the model. Conversely, algorithms that focus on color 
typically ignore the integral over 0; and all other influences of 6; and 0., and also 
assume a greatly simplified color sensitivity function Crgp. Both types of algorithms 
oftentimes assume that the camera response function fcam is linear. In this case, it 
is particularly important to invert the gamma factor as a pre-processing step when 
analyzing pictures from consumer cameras. The exact inversion formula depends on 
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Fig.9.6 Photometric image formation. Light sources emit light rays, which arrive as e(0;, À) with 
wavelength À at an angle of incidence 0; on a surface. The surface reflectance r (0i, 0e, v, A) models 
the reflection. Several of these rays may mix and eventually reach the camera sensor with color 
filters cygp. Further in-camera processing is denoted by fcam 


the color space. For the widely used sRGB color space, the inversion formula for a 
single RGB-color channel isrgg of a pixel is 


Ziege if irag < 0.04045 


(22ireB tI) $ otherwise = 
211 


i linearRGB — 


where we assume that the intensities are in a range from 0 to 1. 

The reflectance r (0i, 0., À) is oftentimes assumed to be purely diffuse, i.e., without 
any specular highlights. This reflectance model is called Lambertian reflectance. In 
this simple model, the angle of exitance ĝe is ignored. The amount of light that is 
reflected from an object only depends on the angle of incidence 6; and the surface 
normal v. More specifically, the amount of reflected light is the cosine between the 
angle of incidence and the surface normal v of the object at that point. The full 
reflectance function also encodes the color of the object, written here as sa (À), which 
yields a product of the cosine due to the ray geometry and the color, 


TLambertian (Ôi, v, A) = cos(0;, v)sa (À) . (9.10) 


In many use cases of this equation, it is common to pull s;(À) out of the equation 
(or to set it to unity, thereby ignoring the impact of color) and to only consider the 
geometric term cos(6;, v). 

The dichromatic reflectance model is a linear combination of purely diffuse and 
specular reflectance, i.e., 


Vdichromatic (Îi, be, V, à) = cos(0;, v)sa Q) + Ws (0;, v, 0.)s; (A) ` (9.1 1) 
Here, the purely diffuse term is identical to the Lambertian reflectance Eq. 9.10. The 


specular term again decomposes into a geometric part ws and a color part ss. Both 
are to my knowledge not explicitly used in forensics, and can hence be superficially 
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treated. However, the geometric part w, essentially contains the mirror equation, i.e., 
the angle of incidence 6; and the angle of exitance 0e have to be mirrored around 
the surface normal. The object color s, is typically set to unity with an additional 
assumption, which is called the neutral interface reflectance function. This has the 
effect that the color of specular highlights is equal to the color of the light source when 
inserting the dichromatic reflectance function rgichromatic into the full photometric 
model in Eq. 9.8. 


93 Algorithms for Physics-Based Forensic Analysis 


This section reviews several forensic approaches that exploit the introduced physical 
models. We introduce in Sect.9.3.1 geometric methods that directly operate on the 
physical models from Sect. 9.2.1. We introduce in Sect. 9.3.2 photometric methods 
that exploit the physical models from Sect. 9.2.2. Then, we present in Sect. 9.3.3 meth- 
ods that slightly relax assumptions from both geometric and photometric approaches. 


9.3.1 Principal Points and Homographies 


We start with a method that extracts relatively strong constraints from the scene. 
Consider a picture that contains written text, for example, signs with political mes- 
sages during a rally. If a picture of that sign is not taken exactly frontally, but at an 
angle, then the letters are distorted according to the laws of perspective projection. 
Conotter et al. proposed a direct application of homographies to validate a proper 
perspective mapping of the text (Conotter et al. 2010): An attacker who aims to 
manipulate the text on the sign must also distort the text. However, if the attacker fits 
the text only in a way that is visually plausible, the laws of perspective projection 
are still likely to be violated. 

To calculate the homography, it is necessary to obtain a minimum of four point 
correspondences between the image coordinates and a reference picture in world 
coordinates. However, if only a single picture is available, there is no outside ref- 
erence for the point correspondences. Alternatively, if the text is printed with a 
commonly used font, the reference can be synthetically created by rendering the 
text in that font. From this reference and the text in the image are SIFT keypoints 
extracted to obtain point correspondences for homography calculation. The matrix 
A from Eq.9.5 is composed with at least four corresponding keypoints points and 
solved. To avoid degenerate solutions, the corresponding points must be selected 
such that there are always at least two corresponding points that do not lie on a line 
with other corresponding points. The homography matrix H can then be used to 
perform the inverse homography, i.e., from image coordinates to world coordinates, 
and to calculate the root mean square error (RMSE) between the reference and the 
transformed image. 
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Fig. 9.7 Example scene for height measurements (“berlin square” by zoetnet; full picture credits 
at the end of the chapter). Lines that converge to the three vanishing points are shown in red, green, 
and blue. With an object or person with known height, the heights of other objects or persons can 
be calculated 


This method benefits from its relatively strong scene constraints. If it is plausible 
to assume that the written text is indeed on a planar surface, and if a sufficiently 
similar font is available (or even the original font), then it suffices to calculate the 
reprojection error of a homography onto the reference. A similar idea can be used if 
two images show the same scene, and one of the images has been edited (Zhang et al. 
2009a). In this case, the second image serves as a reference for the mapping of the 
first image. However, many scenes do not provide an exact reference like the exact 
shape of the written text or a second picture. Nevertheless, if there is other knowledge 
about the scene, slightly more complex variations of this idea can be used. 

For example, Yao et al. show that the vanishing line of a ground plane perpen- 
dicular to the optical axis can be used to calculate the height ratio of two persons 
or objects at an identical distance to the camera (Yao et al. 2012). This height ratio 
may be sufficient when additional prior knowledge is available like the actual body 
height of the persons. Iuliani et al. generalize this approach to ground planes with 
non-zero tilt angles (Iuliani et al. 2015). This is illustrated in Fig.9.7. The left side 
shows in red, green, and blue reference lines from the rectangular pattern on the 
floor and vertical building structures to estimate the vanishing points. The right side 
shows that the height of the persons on the reference plane can then be geometrically 
related. Note two potential pitfalls in this scene: first, when the height of persons is 
related, it may still be challenging to compensate for different body poses (Thakkar 
and Farid 2021). Second, the roadblock on the left could in principle provide the 
required reference height. However, it is not exactly on the same reference plane, as 
the shown public square is not entirely even. Hence, measurements with that block 
may be wrong. 

Another example is to assume some prior information about the geometry of 
objects in the scene. Then, a simplified form of camera calibration can be performed 
on each object. In a second step, all objects can be checked for their agreement on 
the calibration parameters. If two objects disagree, it is assumed that one of these 
objects has been inserted from another picture with different calibration parameters. 
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For example, Johnson and Farid proposed to use the eyes of persons for camera 
calibration (Johnson and Farid 2007a). The underlying assumption is that a specific 
part, namely the eyes’ limbi lie on a plane in 3-D space. The limbus is the boundary 
between the iris and the white of the eye. The limbi are assumed to be perfect circles 
when facing the camera directly. When viewed at an angle, the appearance of the 
limbi is elliptical. Johnson and Farid estimate the homography from these circles 
instead of isolated points. Since the homography also contains information about 
the camera intrinsic parameters, the authors calculate the principal point under the 
additional assumptions that the pixel skew is zero, and that the focal length is known 
from metadata or contextual knowledge of the analyst. In authentic photographs, the 
principal point can be assumed to be near the center of the image. If the principal point 
largely deviates from the image center, it can be plausibly concluded that the image 
is cropped (Xianzhe et al. 2013; Fanfani et al. 2020). Moreover, if there are multiple 
persons in the scene with largely different principal points, it can be concluded that 
the image is spliced from two sources. In this case, it is likely that the relative 
position of one person was changed from the source image to the target image, e.g., 
by copying that person from the left side of the image to the right side. Another useful 
type of additional knowledge can be structures with orthogonal lines like man-made 
buildings. These orthogonal lines can be used to estimate their associated vanishing 
points, which also provides the camera principal point (Iuliani et al. 2017). 

An inventive variation of these ideas has been used by Conotter et al. (2012). 
They investigate ballistic motions in videos, as it occurs, e.g., for a video of a thrown 
basketball. Here, the required geometric constraint does not come from an object per 
se, but instead from the motion pattern of an object. The assumption of a ballistic 
motion pattern includes a linear parabolic motion without external forces except for 
initial acceleration and gravity. This also excludes drift from wind. Additionally, 
the object is assumed to be rigid and compact with a well-defined center of mass. 
Under these assumptions, the authors show that the consistency of the motion can be 
determined by inserting the analytic motion equation into the perspective projection 
Eq. 9.1. This model holds not only for a still camera, but also for a moving camera if 
the camera motion can be derived from additional static objects in the surrounding. 
The validation of true physical motion is additionally supported by the cue that the 
projection of the object size becomes smaller when the object moves away from the 
camera and vice versa. 

Kee and Farid (2009) and Peng et al. (2017a) use the additional assumption that 
3-D models are known for the heads of persons in a scene. This is a relatively strong 
assumption, but such a 3-D model could, for example, be calculated for people of 
public interest from which multiple photographs from different perspectives exist, 
or a 3-D head model could be captured as part of a court case. With this assumption, 
Kee and Farid show that approximately co-planar landmarks from such a head model 
can also be used to estimate a homography when the focal length can be retrieved 
from the metadata (Kee and Farid 2009). Peng et al. use the full 3-D set of facial 
landmarks and additional face contours to jointly estimate the principal point and 
the focal length (Peng et al. 2017a). Their main contribution is to show that spliced 
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images of faces can be exposed when the original faces have been captured with 
different focal length settings. 


9.3.2 Photometric Methods 


Pixel colors and brightness offer several cues to physics-based forensic analysis. In 
contrast to the approaches of the previous section, the photometric methods do not 
model perspective projections, but instead argue about local intensity distributions of 
objects. We distinguish two main directions of investigation, namely the distribution 
of the amount of light that illuminates an object from various directions and the color 
of light. 


9.3.2.1 Lighting Environments 


Johnson and Farid proposed the first method to calculate the distribution of incident 
light on an object or person (Johnson and Farid 2005, 2007b). We discuss this 
method in greater detail due to its importance in the field of physics-based methods. 
The method operates on a simplified form of the generalized irradiance model from 
Eq.9.8. One simplifying assumption is that the camera function is linear, or that 
a non-linear camera response has been inverted in a pre-processing step. Another 
simplifying assumption is that the algorithm is only used on objects of a single color, 
hence all dependencies of Eq.9.8 on wavelength A can be ignored. Additionally, 
Lambertian reflectance is assumed, which removes the dependency on ĝe. These 
assumptions lead to the irradiance model 


i(v) = f cos(0;, v) - e(0;) d0, (9.12) 


6; 


for a single image pixel showing object surface normal v. Here, we directly inserted 
Eq.9.10 for the Lambertian reflectance model without the wavelength A, i.e., the 
color term in sa (À) in Eq. 9.10 is set to unity. 

Johnson and Farid present two coupled ideas to estimate the distribution of incident 
light: First, the integral over the angles of incident rays can be summarized by only 
9 parameters in the orthonormal spherical harmonics basis. Second, these nine basis 
coefficients can be easily regressed when a minimum of 9 surface points are available 
where both the intensity i and their associated surface normal vectors v are known. 

Spherical harmonics are a frequency representation of intensities on a sphere. In 
our case, they are used to model the half dome of directions from which light might 
fall onto a surface point of an opaque object. We denote the spherical harmonics 
basis functions as h; ;(x, y, z) where j < 2i — 1 and the parameters x, y, and z 
are points on a unit sphere, i.e., ||(x, y, z)! |” = 1. Analogous to other frequency 
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transforms like the Fourier transform or the DCT, the zeroth basis function hoo 
is the DC component that contains the overall offset of the values. Higher orders, 
i.e., where i > 0, contain increasingly higher frequencies. However, it suffices to 
consider only the basis functions up to second order (i.e., i < 2), to model all possible 
intensity distributions that can be observed under Lambertian reflectance. These 
orders i = {0, 1, 2} consist of a total of 9 basis functions. For these 9 basis functions, 
the coefficients are estimated in the proposed algorithm. 

The solution for the lighting environment requires knowledge about the surface 
normals from image locations where the brightness is measured. Obtaining such 
surface normals from only a 2-D image can be difficult. Johnson and Farid propose 
two options to address this challenge. First, if the 3-D structure of an object is known, 
a 3-D model can be created and fitted to the 2-D image. Then, the normals from the 
fitted 3-D model can be used. This approach has been demonstrated by Kee and Farid 
for faces, for which reliable 3-D model fitting methods are available (Kee and Farid 
2010). Second, if no explicit knowledge about the 3-D structure of an object exists, 
then it is possible to estimate surface normals from occluding contours of an object. 
Atan occluding contour, the surface normal is approximately coplanar with the image 
plane. Hence, the z-componentis zero, and the x- and y-components can be estimated 
as lines that are orthogonal to the curvature of the contour. However, without the 
z-component, also the estimated lighting environment can only be estimated as a 
projection onto the 2-D image plane. On the other hand, the spherical harmonics 
model also becomes simpler. By setting all coefficients that contain z-components 
to zero, only 5 unknown coefficients remain (Johnson and Farid 2005, 2007b). 

The required intensities can be directly read from the pixel grid. Johnson and 
Farid use the green color channel, since this color channel is usually most densely 
sampled by the Bayer pattern, and it has a high sensitivity to brightness differences. 
When 2-D normals from occluding contours are used, the intensities in the actual 
pixel location at the edge of an object might be inaccurate due to sampling errors 
and in-camera processing. Hence, in this case the intensity is extrapolated from the 
nearest pixels within the object along the line of the normal vector. 

The normals and intensities from several object locations are the known factors 
in a linear system of equations 

Al=i, (9.13) 


where i € R*! are the observed intensities at N pixel locations on an object. For 
these N locations, matching surface normals must be available and all constraints 
must be satisfied. In particular, the N locations must be selected from the same surface 
material, and they must be directly illuminated. A is the matrix of the spherical 
harmonics basis functions. In the 3-D case, its shape is RN>9 and in the 2-D case 
it is RY*°. Each of the N rows in this matrix evaluates the basis functions for the 
surface normal of the associated pixel. The vectors l are the unknown coefficients of 
the basis functions, with dimension R°*! in the 3-D case and R°*! in the 2-D case. 
An additional Tikhonov regularizer dampens higher frequency spherical harmonics 
coefficients, which yields the objective function 
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EQ) = ||A1—ill2 + RI , (9.14) 


with the Tikhonov regularizer R = diag(1 2 2 2 3 3 3 3 3) for the 3-D case, and 
R = diag(1 2 2 3 3) for the 2-D case. A minimum is found after differentiation 
with respect to l and solving for l, i.e., 


l= (ATA + uR'R) UAT. (9.15) 


The lighting environments of two objects can be directly compared via correlation. 
A simple approach is to render two spheres from the coefficients of each object, and 
to correlate the rendered intensities in the 3-D case, or just the intensities along the 
boundary of the sphere in the 2-D case. Alternatively, the coefficients can also be 
directly compared after a suitable transform (Johnson and Farid 2005, 2007b). 

This first work on the analysis of lighting environments inspired many follow-up 
works to relax the relatively restrictive assumptions of the method. Fan et al. (2012) 
explored shape-from-shading as an intermediate step to estimate 3-D surface normals 
when an explicit 3-D object model is not available. Riess et al. added a reflectance 
term to the 2-D variant of the algorithm, such that also contours from different mate- 
rials can be used (Riess et al. 2017). Carvalho et al. investigate human annotations 
of 3-D surface normals on objects other than faces (Carvalho et al. 2015). Peng et 
al. automate the 3-D variant by Kee and Farid (2010) via automated landmark detec- 
tion (Peng et al. 2016). Seuffert et al. show that high-quality 3-D lighting estimation 
requires a good-fitting geometric model (Seuffert et al. 2018). Peng et al. further 
include a texture term for 3-D lighting estimation on faces (Peng et al. 2015, 2017b). 

Some methods also provide alternatives to the presented models. The assumption 
of an orthographic projection has also been relaxed, and will be separately discussed 
in Sect.9.3.3. Matern et al. integrate over the 2-D image gradients, which provides 
a slightly less accurate, but very robust 2-D lighting estimation (Matern et al. 2020). 
Huang et al. propose to calculate 3-D lighting environments from general objects 
using surface normals from shape-from shading algorithms (Huang and Smith 2011). 
However, this classic computer vision approach requires relatively simple objects 
such as umbrellas to be robustly applicable. Zhou et al. train a neural network to 
learn the lighting estimation from faces in a fully data-driven manner (Zhou et al. 
2018). 

An example 2-D lighting estimation is shown on top of Fig.9.8. The person is 
wearing a T-shirt, which makes it difficult to find surface normals that are of the 
same material and at the same time point in a representative number of directions. 
Contours for the T-shirt and the skin are annotated in green and red, together with 
their calculated normals. The x-axes of the scatterplots contain the possible angles of 
the surface normals between —z and +z. The y-axes contain the image intensities. 
The left scatterplot naively combines the intensities of the black T-shirt (green) and 
the light skin (red). Naively fitting the spherical harmonics to the distribution of both 
materials leads to a failure case: the light skin dominates the estimation, such that 
the dominant light source (maximum of the red line) appears to be located below the 
person. On the other hand, using only the skin pixels or only the T-shirt pixels leads to 
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Fig. 9.8 2-D lighting estimation on objects with multiple materials. Top: Example 2-D lighting 
estimation according to Johnson and Farid (2005, 2007b). From left to right: contour annotations 
and estimated normals for two materials (green and red lines) on a person. The scatterplots show the 
angle of the normal along the x-axis with the associated pixel intensities along the y-axis. The left 
scatterplot mixes both materials, which leads to a wrong solution. The right scatterplot performs 
an additional reflectance normalization (Riess et al. 2017), such that both materials can be used 
simultaneously. Bottom: 2-D lighting environments can also be estimated from image gradients, 
which are overall less accurate, but considerably more robust to material transitions and wrong or 
incomplete segmentations 


a very narrow range of surface normals, which also makes the estimation unreliable. 
The right plot shows the same distribution, but with the reflectance normalization 
by Riess et al. (2017), which correctly estimates the location of the dominant light 
source above the person. The three spheres on the bottom illustrate the estimation 
of the light source from image gradients according to Matern et al. (2020). This 
method only assumes that objects are mostly convex, and that the majority of local 
neighborhoods for gradient computation consists of the same material. With these 
modest assumptions, the method can oftentimes still operate on objects with large 
albedo differences, and on object masks with major errors in the segmentation. 


9.3.2.2 Color of Illumination 


The spectral distribution of light determines the color formation in an image. For 
example, the spectrum of sunlight is impacted by the light path through the atmo- 
sphere, such that sunlight in the morning and in the evening is more reddish than at 
noon. As another example, camera flashlight oftentimes exhibits a strong blue compo- 
nent. Digital cameras typically normalize the colors in an image with a manufacturer- 
specific white-balancing function that oftentimes uses relatively simple heuristics. 
Post-processing software such as Adobe Lightroom provides more sophisticated 
functions for color adjustment. 

These many influencing factors on the colors of an image make their forensic 
analysis interesting. It is a reasonable assumption that spliced images exhibit differ- 
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ences in the color distribution if the spliced image components had different lighting 
conditions upon acquisition or if they had undergone different post-processing. 

Forensics analysis methods assume that the camera white-balancing is a global 
image transformation that does not introduce local color inconsistencies in an original 
image. Hence, local inconsistencies in the color formation are attributed to potential 
splicing manipulations. 

There are several possibilities to locally analyze the consistency of the lighting 
color. One key requirement is the need to estimate the illuminant color locally on 
individual objects to expose inconsistencies. Particularly well-suited for local illumi- 
nation color estimation is the dichromatic reflectance model from Eq. 9.11 in com- 
bination with the neutral interface reflectance function. This model includes diffuse 
and specular reflectance, with the additional assumption that the specular portion 
exhibits the color of the light source. 

To our knowledge, the earliest method that analyzes reflectance has been pre- 
sented by Gholap and Bora (2008). This work proposes to calculate the intersection 
of so-called dichromatic lines of multiple objects to expose splicing, using a classical 
result by Tominaga and Wandell (1989): on a monochromatic object, the distribu- 
tion of diffuse and specular pixels forms a 2-D plane in the 3-D RGB space. This 
plane becomes a line when projecting it onto the 2-D r-g chromaticity space, where 
r=ir/(ir tig tig) and g = ig/(ir + ic +ig) denote the red and green color 
channels, normalized via division by the sum of all color channels of a pixel. If the 
scene is illuminated by a single, global illuminant, the dichromatic lines of differ- 
ently colored objects all intersect at one point. Hence, Gholap and Bora propose to 
check for three or more scene objects whether their dichromatic lines intersect in a 
single point. This approach has the advantage that it is very simple. However, to our 
knowledge it barely provides possibilities to validate the model assumptions, i.e., 
ways to check whether the assumption of dichromatic reflectance and a single global 
illuminant holds for the objects under investigation. 

Riess and Angelopoulou (2010) propose the inverse-intensity chromaticity (IIC) 
space by Tan et al. (2004) to directly estimate the color of the illuminant. The IIC space 
is simple to calculate, but provides very convenient properties for forensic analysis. 
It also operates on surfaces of dichromatic reflectance, and requires only a few pixels 
of identical material that exhibit a mixture of specular and diffuse reflectance. For 
each color channel, the IIC space is calculated as a 2-D chart for related pixels. Each 
pixel is transformed to a tuple 


! | 
er -( es ) (9.16) 
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where again ix, ig, and ig denote the red, green, and blue color channels. On pixels 
with different portions of specular and diffuse reflectance, the distribution forms a 
triangle or a line that indicates the chromaticity of the light source at the y-axis 
intercept. This is illustrated in Fig.9.9. For persons, the nose region is oftentimes 
a well-suited location to observe partially specular pixels. Close-ups of the manual 
annotations are shown together with the distributions in IIC space. In this original 
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Fig.9.9 Example application of illuminant color estimation via IIC color charts. For each person 
in the scene, a few pixels are selected that are of the same material, and that exhibit a mixture of 
specular and diffuse reflectance. The associated IIC diagrams are shown on the right. The pixel 
distributions (blue) point toward the estimated color of the light source at the y-axis intercept (red 
lines) 


image, the pixel distributions for the red, green, and blue IIC diagrams point to almost 
identical illuminant chromaticities. In real scenes, the illuminant colors are oftentimes 
achromatic. However, the IIC diagrams are very sensitive to post-processing. Thus, 
if an image is spliced from sources with different post-processing, major differences 
in the IIC diagrams may be observed. 

A convenient property of IIC diagrams is their interpretability. Pixel distributions 
that do not match the assumptions do not form a straight line. For example, purely 
diffuse pixels tend to cluster in circular shapes, and pixels of different materials form 
several clusters, which indicates to an analyst the need to revise the segmentation. 

One disadvantage of the proposed method is the need for manual annotations, 
since the automated detection of specularities is a severely underconstrained prob- 
lem, and hence not reliable. To mitigate this issue, Riess and Angelopoulou propose 
to estimate the color of the illuminant on small, automatically segmented image 
patches of approximately uniform chromaticity. They introduce the notion of “illu- 
minant map” for a picture where each image patch is colored with the estimated 
illuminant chromaticity. An analyst can then consider only those regions that match 
the assumptions of the approach, i.e., that exhibit partially specular and partially dif- 
fuse reflectance on dichromatic materials. These regions can be considered reliable, 
and their estimated illuminant colors can be compared to expose spliced images. 
However, for practical use, we consider illuminant maps oftentimes inferior to a 
careful, fully manual analysis. 


228 C. Riess 


Another analytic approach based on the dichromatic reflectance model has been 
proposed by Francis et al. (2014). It is conceptually similar, but separates specular 
and diffuse pixels directly in RGB space. Several works use machine learning to 
automate the processing and to increase the robustness of illuminant descriptors. For 
example, Carvalho et al. use illuminant maps that are calculated from the IIC space, 
and also from the statistical gray edge illuminant estimator, to train a classifier for the 
authenticity assessment of human faces (de Carvalho et al. 2013). By constraining 
the application of the illuminant descriptors only to faces, the variety of possible 
surface materials is greatly restricted, which can increase the robustness of the esti- 
mator. However, this approach also performs further processing on the illuminant 
maps, which slightly leaves the ground of purely physics-based methods and enters 
the domain of statistical feature engineering. Hadwiger et al. investigate a machine 
learning approach to learn the color formation of digital cameras (Hadwiger et al. 
2019). They use images with Macbeth color charts to learn the relationship between 
ground truth colors and their representations of different color camera pipelines. In 
different lines of work on image colors, Guo et al. investigate fake colorized image 
detection via hue and saturation statistics (Guo et al. 2018). Liu et al. investigate 
methods to assess the ambient illumination in shadow regions for their consistency 
by estimating the shadow matte (Liu et al. 2011). 


9.3.3 Point Light Sources and Line Constraints in the 
Projective Space 


Outdoor scenes that are only illuminated by the sun provide a special set of con- 
straints. The sun can be approximated as a point light source, i.e., a source where all 
light is emitted from a single point in 3-D space. Also, indoor scenes may in special 
cases contain light sources that can be approximated as a single point, like a single 
candle that illuminates the scene (Stork and Johnson 2006). 

Such a single-point light source makes the modeling of shadows particularly 
simple: the line that connects the tip of an object with the tip of its cast shadow also 
intersects the light source. Multiple such lines can be used to constrain the position of 
the light source. Conversely, shadows that have been artificially inserted may violate 
these laws of projection. 

Several works make use of this relationship. Zhang et al. (2009b) and Wu et 
al. (2012) show that cast shadows of an object can be measured and validated against 
the length relationships of the persons or objects that cast the shadow. One assumption 
of these works is that the shadow is cast on a ground plane, such that the measurements 
are well comparable. Conversely, Stork and Johnson use cast shadows to verify that 
a scene is illuminated by a specific light source, by connecting occluding contours 
that likely stem from that light source (Stork and Johnson 2006). 
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However, these approaches require a clear correspondence between the location 
on the object that casts the shadow and the location where the shadow is cast to. This 
is not an issue for poles and other thin cylindrical objects, as the tip of the object can 
be easily identified. However, it is oftentimes difficult to find such correspondences 
on objects of more complex shapes. 

O’Brien and Farid hence propose a geometric approach with much more relaxed 
requirements to mark specific locations for object-shadow correspondences (O’Brien 
and Farid 2012). The method is applicable not only to cast shadows, but also to a 
mixture of cast shadows and attached shadows as they occur on smooth surfaces. 
The idea is that if the 2-D image plane would be infinitely large, there would be a 
location onto which the light source is projected, even if it is outside of the actual 
image. Object-shadow correspondences can constrain the projected location of the 
light source. The more accurate an object-shadow correspondence can be identified, 
the stronger is the constraint. However, it is also possible to incorporate very weak 
constraints like attached shadows. If the image has been edited, and, for example, an 
object has been inserted with incorrect illumination, then the constraints from that 
inserted object are likely to violate the remaining constraints in the image. 

The constraints are formed from half-planes in the 2-D image plane. An attached 
shadow separates the 3-D space into two parts, namely one side that is illuminated and 
one side that is in shadows. The projection of the illuminated side onto the 2-D image 
plane corresponds to one half-plane with a boundary normal that corresponds to the 
attached shadow edge. An object-shadow correspondence with some uncertainty 
about the exact location at the object and the shadow separates the 3-D space into a 
3-D wedge of possible light source locations. Mapping this wedge onto the 2-D image 
plane leads to a 2-D wedge. Such a wedge can be constructed from two opposing 
half-planes. 

An application of this approach is illustrated in Fig.9.10. Here, the pen and the 
spoon exhibit two different difficulties for analyzing the shadows, which is shown 
in the close-ups on the top right: the shadow of the pen is very unsharp, such that it 
is difficult to locate its tip. Conversely, the shadow of the spoon is very sharp, but 
the round shape of the spoon makes it difficult to determine from which location 
on the spoon surface the shadow tip originates. However, both uncertainties can be 
modeled as half-plane constraints shown at the bottom. These constraints have a 
common intersection in the image plane outside of the image, shown in pink. Hence, 
the shadows of both objects are consistent. Further objects could now be included in 
the analysis, and their half-plane constraints have to analogously intersect the pink 
area. 

In follow-up work, this approach has been extended by Kee et al. to also include 
constraints from object shading (Kee et al. 2013). This approach is conceptually 
highly similar to the estimation of lighting environments as discussed in Sect. 9.3.2.1. 
However, instead of assuming an orthographic projection, the lighting environments 
are formulated here within the framework of perspective projection to smoothly 
integrate with the shadow constraints. 
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Fig.9.10 Example scene for forensics from cast shadows. The close-ups show the uncertain areas 
of shadow formation. Left: the shadow of the tip of the pen is very unsharp. Right: the round shape 
of the spoon makes it difficult to exactly localize the location that casts the tip of the shadow. 
Perspective constraints according to O’Brien and Farid allow to nevertheless use these uncertainty 
regions for forensic analysis (O’Brien and Farid 2012): the areas of the line constraints overlap 
in the pink region of the image plane (although outside of the actual image), indicating that the 
shadows of both objects are consistent 


9.4 Discussion and Outlook 


Physics-based methods for forensic analysis are based on simplified physical models 
to validate the authenticity of a scene. Journalistic verification uses physics-based 
methods mostly to answer questions about the time and place of an acquisition, 
the academic literature mostly focuses on the detection of inconsistencies within an 
image or video. 

Conceptually, physics-based methods are quite different from statistical 
approaches. Each physics-based approach requires certain scene elements to per- 
form an analysis, while statistical methods can operate on almost arbitrary scenes. 
Also, most physics-based methods require the manual interaction of an analyst to 
provide “world knowledge”, for example, to annotate occluding contours, or to select 
partially specular pixels. On the other hand, physics-based methods are mostly inde- 
pendent of the image or video quality, which makes them particularly attractive for 
analyzing low-quality content or even analog content. Also, physics-based methods 
are inherently explainable by verification of their underlying models. This would in 
principle also make it well possible to perform a rigorous analysis of the impact of 
estimation errors from various error sources, which is much more difficult for statisti- 
cal approaches. Surprisingly, such robustness investigations have until now only been 
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performed to a limited extent, e.g., for the estimation of vanishing points (Iuliani et al. 
2017), perspective distortions (Peng et al. 2017a), or height measurements (Thakkar 
and Farid 2021). 

The future of physics-based methods is challenged by the technical progress in two 
directions: First, the advent of computational images in modern smartphones. Sec- 
ond, physically plausible computer-generated scene elements from modern methods 
in computer graphics and computer vision. 

Computational images compensate for the limited camera optics in smartphones, 
which are due to space constraints not competitive with high-quality cameras. Hence, 
when a modern smartphone captures “an image”, it actually captures a short video, 
and calculates a high-quality single image from that frame sequence. However, if the 
image itself is not the result of a physical image formation process, the validity of 
physics-based models for forensic analysis is inherently questioned. It is currently an 
open question to which extent these computations affect the presented physics-based 
algorithms. 

Recent computer-generated scene elements are created with an increasing con- 
tribution of learned physics-based models for realistically looking virtual reality or 
augmented reality (VR/AR) applications. These approaches are a direct competition 
to physics-based forensic algorithms, as they use very similar models to minimize 
their representation error. This emphasizes the need for multiple complementary 
forensic tools to expose manipulations from cues that are not relevant for the specific 
tasks of such VR/AR applications, and are hence not considered in their optimization. 


9.5 Picture Credits 


Figure 9.1 is published by Dennis Jarvis (archer 10, https://www.flickr.com/photos/ 
archer10/ with an Attribution-ShareAlike 2.0 Generic (CC BY-SA 2.0) License. 
Full text to the license is available at https://creativecommons.org/licenses/by- 
sa/2.0/; link to the original picture is https://www.flickr.com/photos/archer10/ 
2216460729/. The original image is downsampled for reproduction. 

Figure9.7 is published by zoetnet, https://flickr.com/photos/zoetnet/ with an 
Attributed 2.0 Generic (CC BY 2.0) License. Full text to the license is available at 
https://creativecommons.org/licenses/by/2.0/; link to the original picture is https:// 
flickr.com/photos/zoetnet/9527389096/. The original image is downsampled for 
reproduction, and annotations of perspective lines are added. 
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Chapter 10 A) 
Power Signature for Multimedia geile: 
Forensics 


Adi Hajj-Ahmad, Chau-Wai Wong, Jisoo Choi, and Min Wu 


There has been an increasing amount of work surrounding the Electric Network 
Frequency (ENF) signal, an environmental signature captured by audio and video 
recordings made in locations where there is electrical activity. ENF is the frequency 
of power distribution networks, 60 Hz in most of the Americas and 50Hz in most 
other parts of the world. The ubiquity of this power signature and the appearance of 
its traces in media recordings motivated its early application toward time—location 
authentication of audio recordings. Since then, more work has been done toward 
utilizing this signature for other forensic applications, such as inferring the grid 
in which a recording was made, as well as applications beyond forensics, such as 
temporally synchronizing media pieces. The goal of this chapter is to provide an 
overview of the research work that has been done on the ENF signal and to provide 
an outlook for the future. 
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10.1 Electric Network Frequency (ENF): An 
Environmental Signature for Multimedia Recordings 


In this chapter, we discuss the Electric Network Frequency (ENF) signal, an environ- 
mental signature that has been under increased study since 2005 and has been shown 
to be a useful tool for a number of information forensics and security applications. 
The ENF signal is such a versatile tool that later work has also shown that it can have 
applications beyond security, such as for digital archiving purposes and multimedia 
synchronization. 

The ENF signal is a signal influenced by the electric power grid. Most of the 
power provided by the power grid comes from turbines that work as generators of 
alternating current. The rotational velocity of these turbines determines the ENF, 
which usually has a nominal value 60 Hz in most of the Americas and 50 Hz in most 
other parts of the world. The ENF fluctuates around its nominal value as a result of 
power-frequency control systems that maintain the balance between the generation 
and consumption of electric energy across the grid (Bollen and Gu 2006). These 
fluctuations can be seen as random, unique in time, and typically very similar in all 
locations of the same power grid. The changing instantaneous value of the ENF over 
time is what we define as the ENF signal. 

What makes the ENF particularly relevant to multimedia forensics is that the 
ENF can be embedded in audio or video recordings made in areas where there is 
electrical activity. It is in this way that the ENF serves as an environmental signature 
that is intrinsically embedded in media recordings. Once we can extract this invisible 
signature well, we can answer many questions about the recording it was embedded 
in. For example: When was the recording made? Where was it made? Has it been 
tampered with? A recent study has shown that ENF traces can even be extracted from 
images captured by digital cameras with rolling shutters, which can allow researchers 
to determine the nominal frequency of the area in which an image was captured. 

Government agencies and research institutes of many countries have conducted 
ENF-related research and developmental work. This includes academia and govern- 
ment in Romania (Grigoras 2005, 2007, 2009), Poland (Kajstura et al. 2005), Den- 
mark (Brixen 2007a, b, 2008), the United Kingdom (Cooper 2008, 2009a, b, 2011), 
the United States (Richard and Peter 2008; Sanders 2008; Liu et al. 2011, 2012; 
Ojowu et al. 2012; Garg et al. 2012, 2013; Hajj-Ahmad et al. 2013), the Netherlands 
(Huijbregtse and Geradts 2009), Brazil (Nicolalde and Apolinario 2009; Rodriguez 
et al. 2010; Rodriguez 2013; Esquef et al. 2014, Egypt (Eissa et al. 2012; Elme- 
salawy and Eissa 2014), Israel (Bykhovsky and Cohen 2013), Germany (Fechner 
and Kirchner 2014), Singapore (Hua 2014), Korea (Kim et al. 2017; Jeon et al. 
2018), and Turkey (Vatansever et al. 2017, 2019). Among the work studied, we can 
see two major groups of study. The first group addresses challenges in accurately 
estimating an ENF signal from media signals. The second group focuses on possible 
applications of the ENF signal once it is extracted properly. In this chapter, we con- 
duct a comprehensive literature study on the work done so far and outline avenues 
for future work. 
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The rest of this chapter is organized as follows. Section 10.2 describes the meth- 
ods for ENF signal extraction. Section 10.3 discusses the findings of works study- 
ing the presence of ENF traces and providing statistical models for ENF behav- 
lor. Section 10.4 explains how ENF traces are embedded in videos and images and 
how they can be extracted. Section 10.5 delves into the main forensic and security 
ENF applications proposed in the literature. Section 10.6 describes an anti-forensics 
framework for understanding the interplay between an attacker and an ENF ana- 
lyst. Section 10.7 extends the conversation toward ENF applications beyond security. 
Section 10.8 summarizes the chapter and provides an outlook for the future. 


10.22 Technical Foundations of ENF-Based Forensics 


As will be discussed in this chapter, the ENF signal can have a number of useful 
real-world applications, both in multimedia forensics-related fields and beyond. A 
major first step to these applications, however, is to properly extract and estimate the 
ENF signal from an ENF-containing media signal. But, how can we check that what 
we have extracted is Indeed the ENF signal? For this purpose, we introduce the notion 
of power reference signals in Sect. 10.2.1. Afterward, we discuss various proposed 
methods to estimate the ENF signal from a media signal. We focus here on ENF 
extraction from digital one-dimensional (1-D) signals, typically audio signals, and 
delay the explanation of extracting ENF from video and image signals to Sect. 10.4. 


10.2.1 Reference Signal Acquisition 


Power reference recordings can be very useful to ENF analysis. The ENF traces 
found in the power recordings are typically much stronger than the ENF traces found 
in audio recordings. This is why they can be used as a reference and a guide for ENF 
signals extracted from audio recordings, especially in cases where one has access 
to a pair of simultaneously recorded signals, one a power signal and one an audio 
signal; the ENF in both should be very similar at the same instants of time. In this 
section, we describe different methods used in the ENF literature to acquire power 
reference recordings. 

The Power Information Technology Laboratory at the University of Tennessee, 
Knoxville (UTK), operates the North American Power Grid Frequency Monitoring 
Network System (FNET), or GridEye. The FNET/GridEye is a power grid situational 
awareness tool that collects real-time, Global Position System (GPS) timestamped 
measurements of grid reference data at the distribution level (Liu et al. 2012). A 
framework for FNET/GridEye is shown in Fig. 10.1. The FNET/GridEye system 
consists of two major components, which are the frequency disturbance recorders 
(FDRs) and the information management system (IMS). The FDRs are the sensors 
of the system; each FDR is an embedded microprocessor system that performs local 
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Fig. 10.1 Framework of the FNET/GridEye system (Zhang et al. 2010) 


GPS-synchronized measurements, such as computing the instantaneous ENF values 
over time. In this setup, the FDR estimates the power frequency values at a rate of 
10 records/s using phasor techniques (Phadke et al. 1983). The measured data is sent 
to the server through the Internet, where the IMS collects the data, stores it, and 
provides a platform for the visualization and analysis of power system phenomena. 
More information on the FNET/GridEye system can be found in FNET Server Web 
Display (2021), Zhong et al. (2005), Tsai et al. (2007), and Zhang et al. (2010). 

A system similar to the FNET/GridEye system, named the wide area management 
systems (WAMS), has been set up in Egypt, where the center providing the informa- 
tion management functions is at Helwan University (Eissa et al. 2012; Elmesalawy 
and Eissa 2014). The researchers operating this system have noted that during system 
disturbances, the instantaneous ENF value is not the same across all points of the 
grid. It follows that in such cases, the ENF value from a single point in the grid may 
not be a reliable reference. For this purpose, in Elmesalawy and Eissa (2014), they 
propose a method for establishing ENF references from a number of FDRs deployed 
in multiple locations of the grid, rather than from a single location. 

Recently, the authors of Kim et al. (2017) presented an ENF map that takes a differ- 
ent route: Instead of relying on installing specialized hardware, they built their ENF 
map by extracting ENF signals from the audio tracks of open-source online streaming 
multimedia data obtained from such sources as “Ustream”, “Earthcam”, and “Skyline 
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Fig. 10.2 Sample generic schematic of sensoring hardware (Top et al. 2012) 


webcams”. Most microphones used in such streaming services are mains-powered, 
which makes the ENF traces captured in the recordings stronger than those that would 
be captured in recordings made using battery-powered recorders. Kim et al. (2017) 
addressed in detail the challenges that come with the proposed approach, including 
accounting for packet loss, aligning different ENF signals temporally, and interpo- 
lating ENF signals geographically to account for locations that are not covered by 
the streaming services. 

Systems such as the ones discussed offer tremendous benefits in power frequency 
monitoring and coverage, yet one does not need access to them in order to acquire 
ENF references locally. An inexpensive hardware circuit can be built to record a 
power signal or measure ENF variations, given access to an electric wall outlet. Typ- 
ically, a transformer is used to convert the voltage from the wall outlet voltage levels 
down to a level that an analog-to-digital converter can capture. Figure 10.2 shows 
a sample generic circuit that can be built to record the power reference signal (Top 
et al. 2012). 

There is more than one design to build the sensor hardware. In the example of 
Fig. 10.2, an anti-aliasing filter is placed in the circuit along with a fuse for safety 
purposes. In some implementations, such as in Hajj-Ahmad et al. (2013), a step- 
down circuit is connected to a digital audio recorder that records the raw power 
signal, whereas in other implementations, such as in Fechner and Kirchner (2014), 
the step-down circuit is connected to a BeagleBone Black board, via a Schmitt trigger 
that computes an estimate of the ENF signal spontaneously. In the former case, the 
recorded digital signal is processed later using ENF estimation techniques, which 
will be discussed in the next section, to extract the reference ENF signal while in the 
latter case, the ENF signal is ready to be used as a reference for the analysis. 

As can be seen, there are several ways to acquire ENF references depending on the 
resources one has access to. An ENF signal extracted through these measurements 
typically has a high signal-to-noise ratio (SNR) and can be used as a reference in 
ENF research and applications. 
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10.22 ENF Signal Estimation 


In this section, we discuss several approaches that have been proposed in the literature 
to extract the ENF signal embedded in audio signals. 

A necessary stage before estimating the changing instantaneous ENF value over 
time is preprocessing the audio signal. Typically, since the ENF component is in a 
low-frequency band, a lowpass filter with proper anti-aliasing can be applied to the 
ENF-containing audio signal to make the computations of the estimation algorithms 
easier. For some estimation approaches, it also helps to bandpass the ENF-containing 
audio signal around the frequency band of interest, i.e., frequency band surrounding 
the nominal ENF value. Besides that, the ENF-containing signal is then divided into 
consecutive overlapping or nonoverlapping frames. The aim of the ENF extraction 
process would be to apply a frequency estimation approach on each frame to estimate 
its most dominant frequency around the nominal ENF value. This frequency estimate 
would be the estimated instantaneous ENF value for the frame. Concatenating the 
frequency estimates of all the frames together would form the extracted ENF signal. 
The length of the frame, typically on the order of seconds, determines the resolution 
of the extracted ENF signal. Typically, a trade-off exists here. Smaller frame size 
would better capture the ENF variations but may result in poorer performance of the 
frequency estimation approach, and vice versa. 

Generally speaking, an ENF estimation approach can be one of three types: (1) a 
time-domain approach, (2) a nonparametric frequency-domain approach, and (3) a 
parametric frequency-domain approach. 


10.2.2.1 Time-Domain Approach 


The time-domain zero-crossing approach is fairly straightforward, and it is one of 
the few ENF estimation approaches that is not preceded by dividing the record- 
ing into consecutive frames for individual processing. As described in Grigoras 
(2009), a bandpass filter with 49-51 Hz or 59-61Hz cutoff is first applied to the 
ENF-containing signal without downsampling initially. This is done to separate the 
ENF waveform from the rest of the recording. Afterward, the zero-crossings of the 
remaining ENF signal are computed, and the time differences between consecutive 
zero values are computed and used to obtain the instantaneous ENF estimates. 


10.2.2.2 Nonparametric Frequency-Domain Approach 


Nonparametric approaches do not assume any explicit model for the data. Most of 
these approaches are based on the Fourier analysis of the signal. 

Most nonparametric frequency-domain approaches are based on the periodogram- 
based or spectrogram-based approach utilizing the short-time Fourier transform 
(STFT). STFT is often used for signals with a time-varying spectrum, such as speech 
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signals. After the signal is segmented into overlapping frames, each frame undergoes 
Fourier analysis to determine the frequencies present. A spectrogram is then defined 
as the squared magnitude of the STFT and is usually displayed as a two-dimensional 
(2-D) intensity plot, with the two axes being time and frequency, respectively (Hajj- 
Ahmad et al. 2012). 

Because of the slowly varying nature of the ENF signal, it is reasonable to consider 
the instantaneous frequency within the duration of a frame approximately constant for 
analysis. Given a sinusoid of a fixed frequency embedded in noise, the power spectral 
density (PSD) estimated by the STFT should ideally exhibit a peak at the frequency 
of the sinusoidal signal. Estimating this frequency well gives a good estimate for the 
ENF value of this frame. 

A straightforward approach to estimating this frequency would be finding the 
frequency that has the maximum power spectral component. Directly choosing this 
frequency as the ENF value, however, typically leads to a loss in accuracy, because 
the spectrum is computed for discretized frequency values and the actual frequency of 
the maximum energy may not be aligned with these discretized frequency values. For 
this reason, typically, the STFT-based ENF estimation approach carries out further 
computations to obtain a more refined estimate. Examples of such operations are 
quadratic or spline interpolations being done about the detected spectral peak, or 
a weighted approach where the ENF estimate is found by weighing the frequency 
bins around the nominal value based on their spectrum intensities (Hajj-Ahmad et al. 
2012; Cooper 2008, 2009b; Grigoras 2009; Liu et al. 2012). 

In addition to the STFT-based approach that is most commonly used, the authors 
in Ojowu et al. (2012) advocate the use of a nonparametric, adaptive, and high- 
resolution technique known as the time-recursive iterative adaptive approach (TR- 
IAA). This algorithm reaches the spectral estimates of a given frame by minimizing 
a quadratic cost function using a weighted least squares formulation. It is an iterative 
technique that takes 10-15 iterations to converge, where the spectral estimate is 
initialized to be either the spectrogram or the final spectral estimate of the preceding 
frame (Glentis and Jakobsson 2011). As compared to STFT-based techniques, this 
approach is more computationally extensive. In Ojowu et al. (2012), the authors 
report that the STFT-based approach gives slightly better estimates of the network 
frequency when the SNR is high, yet the adaptive TR-IAA approach achieves a 
higher ENF estimation accuracy in the presence of interference from other signals. 

To enhance the ENF estimation accuracy, Ojowu et al. (2012) propose a frequency 
tracking method based on dynamic programming that finds a minimum cost path. 
For each frame, the method generates a set of candidate frequency peak locations 
from which the minimum cost path is generated. The cost function selected in this 
proposed extension takes into account the slowly varying nature of the ENF and 
penalizes significant jumps in frequency from frame to frame. The minimum path 
found is the estimated ENF signal. 

In low SNR scenarios, the sets of candidate peak frequency locations used by 
Ojowu et al. (2012) may not be precisely estimated. To avoid using imprecise peak 
locations, the Zhu et al. (2018, 2020) apply dynamic programming directly to a 2-D 
time-frequency representation such as the spectrogram. The authors also propose to 
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extract multiple ENF traces of unequal strengths by iteratively estimating and erasing 
the strongest one. A near-real-time variant of the multiple ENF extraction algorithm 
was proposed in Zhu et al. (2020) to facilitate efficient online frequency tracking. 

There have been other proposed modifications to nonparametric approaches in 
the literature to improve the final ENF signal estimate. In Ling Fu et al. (2013), the 
authors propose a discrete Fourier transform (DFT)-based binary search algorithm to 
lower the computational complexity. Instead of calculating the full Fourier spectrum, 
the proposed method uses the DFT to calculate a spectral line at the midpoint of 
a frequency interval at each iteration. At the end of the iteration, the frequency 
interval will be replaced by its left or right half based on the relative strength of the 
calculated spectral line and that of the two ends of the interval. The search stops once 
the frequency interval is narrow enough, and the estimated frequency of the current 
frame will be used to initialize the candidate frequency interval of the next frame. 

In Dosiek (2015), the author considers the ENF estimation problem as a fre- 
quency demodulation problem. By considering the captured power signal to be a 
carrier sinusoid of nominal ENF value modulated by a weak stochastic signal, a 
0-Hz intermediate frequency signal can be created and analyzed instead of using the 
higher frequency modulated signal. This allows the ENF to be estimated through the 
use of FM algorithms. 

In Georgios Karantaidis and Constantine Kotropoulos (2018), the authors sug- 
gested using refined periodograms as the basis for the nonparametric frequency 
estimation approach, including Welch, Blackman—Tukey, and Daniel, as well as the 
Capon method, which is a filter bank approach based on a data-dependent filter. 

In Xiaodan Lin and Xiangui Kang (2018), the authors propose to improve the 
extraction of ENF estimates in cases of low SNR by exploiting the low-rank structure 
of the ENF signal in an approach that uses robust principal component analysis to 
remove interference from speech content and background noise. Weighted linear 
prediction is then involved in extracting the ENF estimates. 


10.2.2.3 Parametric Frequency-Domain Approach 


Parametric frequency-domain ENF estimation approaches assume an explicit model 
for the signal and the underlying noise. Due to such an explicit assumption about 
the model, the estimates obtained using parametric approaches are expected to be 
more accurate than those obtained using nonparametric approaches if the model- 
ing assumption is correct (Manolakis et al. 2000). Two of the most widely used 
parametric frequency estimation methods are based on the subspace analysis of a 
signal—noise model, namely the MUlItiple SIgnal Classification (MUSIC) (Schmidt 
1986) and Estimation of Signal Parameters via Rotational Invariance Techniques 
(ESPRIT) (Roy and Kailath 1989). These methods can be used to estimate the fre- 
quency of a signal composed of P complex frequency sinusoids embedded in white 
noise. As ENF signals consist of a real sinusoid, the value of P for ENF signals is 2 
when no harmonic signals exist. 
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The MUSIC algorithm is a subspace-based approach to frequency estimation 
that relies on eigendecomposition and the properties between the signal and noise 
subspaces for sinusoidal signals with additive white noise. MUSIC makes use of the 
orthogonality between the signal and noise subspaces to compute a pseudo-spectrum 
for a signal. This pseudo-spectrum should have P frequency peaks corresponding to 
the dominant frequency components in the signal, i.e., the ENF estimates. 

The ESPRIT algorithm makes use of the rotational property between staggered 
subspaces that are invoked to produce the frequency estimates. In our case, this 
property relies on observations of the signal over two intervals of the same length 
staggered in time. ESPRIT is similar to MUSIC in the sense that they are both 
subspace-based approaches, but it is different in that it works with the signal subspace 
rather than the noise subspace. 


10.2.3 Higher Order Harmonics for ENF Estimation 


The electric power signal is subject to waveform distortion due to the presence of 
nonlinear elements in the power system. A nonlinear element takes a non-sinusoidal 
current for a sinusoidal voltage. Thus, even for a nondistorted voltage waveform, the 
current through a nonlinear element is distorted. This distorted current waveform, 
in turn, leads to a distorted voltage waveform. Though most elements of the power 
network are linear, transformers are not, especially during sustained over-voltages 
and power electronic components. Besides that, a main reason behind the distortion 
is due to nonlinear load, mainly power-electronic converters (Bollen and Gu 2006). 

A significant way in which the waveform distortion of the power signal manifests 
itself is in harmonic distortion, where the power signal waveform can be decomposed 
into a sum of harmonic components with the fundamental frequency being close to 
the 50/60 Hz nominal ENF value. It follows that scaled versions of almost the same 
variations appear in many of the harmonic bands, although the strength of the traces 
may differ at different harmonics with varying recording environments and devices 
used. An example of this can be seen in Fig. 10.3. In extracting the ENF signal, 
we can take advantage of the presence of ENF traces at multiple harmonics of the 
nominal frequency in order to make our ENF signal estimate more robust. 

In Bykhovsky and Cohen (2013), the authors extend the ENF model from a single- 
tone signal to a multi-tone harmonic one. Under this model, they use the Cramer—Rao 
bound for the estimation model that shows that the harmonic model can lead to a 
theoretical O (M 3) factor improvement in the ENF estimation accuracy, with M 
being the number of harmonics. The authors then derive a maximum likelihood 
estimator for the ENF signal, and their results show a significant gain as compared 
with the results on the single-tone model. 

In Hajj-Ahmad et al. (2013), the authors propose a spectrum combining approach 
to ENF signal estimation, which exploits the different ENF components appearing in 
a signal and strategically combines them based on the local SNR values. A hypothesis 
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Fig. 10.3 Spectrogram strips about the harmonics 60 Hz for two sets of simultaneously recorded 
power and audio measurements (Hajj-Ahmad et al. 2013) 


testing performance of an ENF-based timestamp verification application is examined, 
which validates that the proposed approach achieves a more robust and accurate 
estimate than conventional ENF approaches that rely solely on a single tone. 


10.3 ENF Characteristics and Embedding Conditions 


Despite that there has been an increasing amount of work recently geared toward 
extracting the ENF signal from media recordings and then using it for innovative 
applications, there has been relatively less work done toward understanding the con- 
ditions that promote or hinder the embedding of ENF in media recordings. Further 
work also can be done toward understanding the way the ENF behaves. In this section, 
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we will first go over the existing literature on the conditions that affect ENF capture 
and then discuss recent statistical modeling efforts on the ENF signal. 


10.3.1 Establishing Presence of ENF Traces 


Understanding the conditions that promote or hinder the capture of ENF traces in 
media recordings would go a long way in helping us benefit from the ENF signal in 
our applications. It would also help us understand better the situations in which ENF 
analysis is applicable. 

If a recording is made with a recorder connected to the electric mains power, it is 
generally accepted that ENF traces will be present in the resultant recording (Fechner 
and Kirchner 2014; Kajstura et al. 2005; Brixen 2007a; Huijbregtse and Geradts 2009; 
Cooper 2009b, 2011; Garg et al. 2012). The strength and presence of the ENF traces 
depend on the recording device’s internal circuitry and electromagnetic compatibility 
characteristics (Brixen 2008). 

If the recording is made with a battery-powered recorder, the question of whether 
or not the ENF will be captured in the recording becomes more complex. Broadly 
speaking, ENF capturing can be affected by several factors that can be divided into 
two groups: factors related to the environment in which the recording was made 
and factors related to the recording device used to make the recording. Interactions 
between different factors may as well lead to different results. For instance, elec- 
tromagnetic fields in the place of recording promote ENF capturing if the recording 
microphone is dynamic but not in the case where the recording microphone is elec- 
tret. Table 10.1 shows a sample of factors that have been studied in the literature for 
their effect on ENF capture in audio recordings (Fechner and Kirchner 2014; Brixen 
2007a; Jidong Chai et al. 2013). 

Overall, the most common cause of ENF capture in audio recordings is the acoustic 
mains hum, which can be produced by mains-powered equipment in the place of 
recording. The hypothesis that this background noise is a carrier of ENF traces was 
confirmed in Fechner and Kirchner (2014). Experiments carried out in an indoor 
setting suggested high robustness of ENF traces where the ENF traces were present 
in a recording made 10m away from a noise source located in a different room. 
Future work will still need to conduct large-scale empirical studies to infer how 
likely real-world audio recordings will contain distinctive ENF traces. 

Hajj-Ahmad et al. (2019) conducted studies exploring factors that affect the cap- 
ture of ENF traces. They demonstrated that moving a recorder while making a record- 
ing will likely compromise the quality of the ENF being captured, due to the Doppler 
effect, possibly in conjunction with other factors such as air pressure and other vibra- 
tions. They also showed that using different recorders in the same recording setting 
can lead to different strengths of ENF traces captured and at different harmonics. 
Further studies along this line will help understand better the applicability of ENF 
research and inform the design of scalable ENF-based applications. 
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Table 10.1 Sample of factors affecting ENF capture in audio recordings made by battery-powered 
recorders (Hajj-Ahmad et al. 2019) 


Factors Effect 


Environmental Electromagnetic (EM) fields | Promote ENF capture in 
recordings made by dynamic 
microphones but not in those 
made by electret microphones 


Acoustic mains hum Promotes ENF capture; 
sources include fans, power 
adaptors, lights, and fridges 


Electric cables in vicinity Not sufficient for ENF capture 


Device-related Type of microphone Different types have different 
reactions to the same sources, 
e.g., to EM fields 

Frequency band of recorders | Recorders may be incapable of 


recording low frequencies, 
e.g., around 50/60 Hz 


Internal compression by Strong compression, e.g., 
recorder Adaptive Multi-Rate, can limit 
ENF capturing 


In Zhu et al. (2018), Zhu et al. (2020), the authors propose an ENF traces presence 
test for 1-D signals. The presence of the ENF is first tested for each frame using 
a test statistic quantifying the relative energy within a small neighborhood of the 
frequency of the spectral peak. A frame can be classified as “voiced” or “unvoiced” 
by thresholding the test statistic. The algorithm merges frames of the same decision 
type into a segment while allowing frames of the other type to be sparsely presented 
in a long segment. This refinement produces a final decision result containing two 
types of segments that are sparsely interleaved over time. 

In Vatansever et al. (2017), the authors address the issue of the presence of ENF 
traces in videos. In particular, the problem they address is that ENF analysis on 
videos can be computationally expensive, and typically, there is no guarantee that 
a video would contain ENF traces prior to analysis. In their paper, they propose an 
approach to assess a video before further analysis to understand whether it contains 
ENF traces. Their ENF detection approach is based on using superpixels, which 
are steady object regions having very close reflectance properties, in a representative 
video frame. They show that their algorithm can work on video clips as short as 2 min 
and can operate independently of the camera image sensor type, i.e., complementary 
metal-oxide semiconductor (CMOS) or charge-coupled device (CCD). 
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10.3.2 Modeling ENF Behavior 


Understanding how the ENF behaves over time can be important toward understand- 
ing if a detected frequency component corresponds to the ENF or not. Several studies 
have carried out statistical modeling on the ENF variations. In Fig. 10.4, we can see 
that ENF values collected from the UK grid over one month follow a Gaussian-like 
distribution (Cooper 2009b). 

Other studies have been carried out on the North American grids, where there are 
four interconnections, namely Eastern Interconnection (EI), Western Interconnec- 
tion (WECC), Texas Interconnection (ERCOT), and a Quebec Interconnection. The 
studies show that the ENF generally follows Gaussian distributions in the Eastern 
interconnection, the Western interconnection, and the Quebec interconnection, yet 
the mean and standard deviation values are different across interconnections. For 
instance, the Western interconnection shows a smaller standard deviation than the 
Eastern interconnection indicating that it maintains a slightly tighter control over 
the ENF variations. The differences in density are reflective of the control strategies 
employed on the grids and the size of the grids (Liu et al. 2011, 2012; Garg et al. 
2012; Top et al. 2012). A further indication of the varying behaviors of ENF varia- 
tions across different grids is that a study has shown that the ENF data collected in 
Singapore do not strictly follow a Gaussian distribution (Hua 2014). 

In Garg et al. (2012), the authors have shown that the ENF signal in the North 
American Eastern interconnection can be modeled as a piecewise wide sense sta- 
tionary (WSS) signal. Following that, they model the ENF signal as a piecewise 
autoregressive (AR) process, and show how this added understanding of the ENF 
signal behavior can be an asset toward improving performance in ENF applications, 
such as with time—location authentication, which will be discussed in Sect. 10.5.1. 


Fig. 10.4 Histogram of ENF x 10° 
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10.4 ENF Traces in the Visual Track 


In Garg et al. (2011), the authors showed for the first time that the ENF signal can 
be sensed and extracted from the light illumination using a customized photodiode 
circuit. In this section, we explain how the ENF signal can be extracted from video 
and image signals captured by digital cameras. First, we describe each stage of the 
pipeline that an ENF signal goes through and is processed. Different types of imaging 
sensors, namely CCD and CMOS, will be discussed. Second, we explain the steps 
to extract the ENF signal. We start from simple videos with white-wall scenes and 
then move to more complex videos with camera motion, foreground motion, and 
brightness change. 


10.4.1 Mechanism of ENF Embedding in Videos and Images 


The intensity fluctuation in visual recordings is caused by the alternating pattern of the 
supply current/voltage. We define ENF embedding as the process of adding intensity 
fluctuations caused by the ENF to visual recordings. The steps of the ENF embedding 
are illustrated in Fig. 10.5. The process starts by converting the AC voltage/current 
into a slightly flickering light signal. The light then travels through the air with its 
energy attenuated. When it arrives at an object, the flickering light interacts with the 
surface of the object, producing a reflected light that flickers at the same speed. The 
light continues to travel through the air and the lens and goes through another round 
of attenuation before it arrives at the imaging sensor of a camera or the retina of a 
human. Through a short temporal accumulation process that is lowpass in nature, 
a final sensed signal will contain the flickering component in addition to the pure 
visual signal. We describe each stage of ENF embedding in Sect. 10.4.1.1. 

Depending on whether a camera’s image sensor type is CCD or CMOS, the flick- 
ering due to the ENF will be captured at the frame level or the row level, respectively. 
The latter scenario has an equivalent sampling rate that is hundreds or thousands of 
times that of the former scenario. We provide more details about CCD and CMOS 
in Sect. 10.4.1.2. 


10.4.1.1 Physical Embedding Processes 


Given the supply voltage 


v(t) = A(t) cos (2x f f(t)dt + 6») (10.1) 
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Fig. 10.5 Illustration of the ENF embedding process for visual signals. The supply voltage v(t) 
with frequency f(t) is first converted into the visual light signal, which has a DC component and 
an AC component flickering at 2 f (t) Hz. The visual light then reaches an object and interacts 
with the object’s surface. The reflected light emitted from the object arrives at a camera’s imaging 
sensor at which photons are collected for a duration of the exposure time, resulting in a sampled 
and quantized temporal visual signal whose intensity fluctuates at 2 f(t) Hz 


with a time-varying voltage amplitude A(t), a time-varying frequency f(t) Hz, and 
a random initial phase ġo, it follows from the power law that the frequency of supply 
power is double that the supply voltage,! namely 


P(t) = u2(t)/ R = A2(t) cos” (= Í f(t)dt + 6») /R (10.2a) 
A°(t) t 
= feos (= f [2f(t)]dt + 2%) + '| . _(10.2b) 


Note that for the nonnegative supply power P (t), the ratio of the amplitude of the AC 
component to the strength of the DC is equal to one. The voltage to power conversion 
is illustrated by the first block of Fig. 10.5. 

Next, the supply power in the electrical form is converted into the electromagnetic 
wave by a lighting device. The conversion process equivalently applies a lowpass filter 
to attenuate the relative strength of the AC component based on the optoelectronic 
principles of the fluorescent or incandescent light. The resulting visual light contains 
a stable DC component and a relatively smaller flickering AC component with a 


' We simplify the above analysis by assuming that the load is purely resistive but the result applies 
to inductive and capacitive loads as well. 
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time-varying frequency 2 f (t). The power to light conversion is illustrated by the 
second block of Fig. 10.5. 

A ray of the visual light then travels to an object and interacts with the object’s 
surface to produce reflected or reemitted light that can be picked up by the eyes of 
an observer or a camera’s imaging system. The overall effect is a linear scaling to 
the magnitude of the visual light signal, which preserves its flickering component 
for the light arriving at the lens. The third block of Fig. 10.5 illustrates a geometric 
setup where a light source shines on an object, and a camera acquires the reflected 
light. Given a point p on the object, the intensity of reflected light arriving at the 
camera is dependent on the light—objective distance d(p) and the camera—object 
distance d’ (p) per the inverse-square law that light intensity is inversely proportional 
to the squared distance. The intensity of the reflected light is also determined by the 
reflection characteristics of the object at p. The reflection characteristics can be 
broadly summarized into the diffuse reflection and the specular reflection, which 
are widely adopted in computer vision and computer graphics for understanding 
and modeling everyday vision tasks (Szeliski 2010). The impact of reflection on the 
intensity of the reflected light is a combined effect of albedo, the direction of the 
incident light, the orientation of the object’s surface, and the direction of the camera. 
From the camera’s perspective, the light intensities at different locations of the object 
form an image of the object (Szeliski 2010). Adding a time dimension, the intensities 
of all locations fluctuate synchronously at the speed of 2 f (t) Hz due to the changing 
appearance caused by the flickering light. 

The incoming light is further sampled and quantized by the camera’s sensing unit 
to create digital images or videos. Through the imaging sensor, the photons of the 
incoming light reflected from the scene are converted into electrons and subsequently 
into voltage levels that represent the intensity levels of pixels of a digital image. 
Photons are collected for a duration of the exposure time to accumulate a sizable mass 
to clearly depict the scene. Hajj-Ahmad et al. (2016) show that the accumulation of 
photons can be viewed as convolving a rectangular window to the perceived light 
signal arriving at the lens as illustrated in the last block of Fig. 10.5. This is a second 
lowpass filtering process that further reduces the strength of the AC component 
relative to the DC. 


10.4.1.2 Rolling Versus Global Shutter 


When images and videos are digitized by cameras, depending on whether the cam- 
era uses a global-shutter or rolling-shutter sensing mechanism, the ENF signal will 
be embedded into the visual data in different ways. CCD cameras usually capture 
images using global shutters. It exposes and reads out all pixels of an image/frame 
simultaneously, hence the ENF is embedded by capturing the global intensity of 
the image. In contrast, CMOS cameras usually capture images using rolling shut- 
ters (Jinwei Gu et al. 2010). A rolling shutter exposes and reads out only one row of 
pixels at a time, hence the ENF is embedded by sequentially capturing the intensity of 
every row of an image/frame. The rolling shutter digitizes rows of each image/frame 
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Fig. 10.6 (Left) Rolling-shutter sampling time diagram of a CMOS camera: Rows of each frame 
are sequentially exposed and read out, followed by an idle period before proceeding to the next 
frame (Su et al. 2014a). (Right) The row signal of a white-wall video generated by averaging the 
pixel intensities of each row and concatenating the averaged values for all frames (Su et al. 2014b). 
The discontinuities are caused by idle periods 


sequentially, making it possible to sense the ENF signal much faster than using the 
global shutter—the effective sampling rate is scaled up by a multiplicative factor 
that equals the number of rows of sensed images, which is usually on the order of 
hundreds or thousands. The left half of Fig. 10.6 shows a timing diagram of when 
each row and each frame are acquired using a rolling shutter. Rows of each frame are 
sampled uniformly in time, followed by an idle period before proceeding to the next 
frame (Su et al. 2014a). The right half of Fig. 10.6 shows a row signal of a white-wall 
video (Su et al. 2014b), which can be used for frequency estimation/tracking. The 
row Signal is generated by averaging the pixel intensities of each row and concatenat- 
ing the averaged values for all frames (Su et al. 2014b). Note that the discontinuities 
are caused by the missing values during the idle periods. 


10.4.2 ENF Extraction from the Visual Track 


ENF signals can be embedded in both global-shutter and rolling-shutter videos as 
explained in Sect. 10.4.1.2. However, being able to successfully extract an ENF signal 
from a global-shutter video needs a careful selection of a camera’s frame rate. This 
is because global-shutter videos may suffer from frequency contamination due to 
aliasing caused by an insufficient sampling rate normally ranging from 24 to 30 fps. 
Two typical failure examples are using a 25-fps camera in an environment with light 
flickering 100 Hz or using a 30-fps camera with 120 Hz light. We will detail this 
challenge in Sect. 10.4.2.1. 

The rolling shutter, traditionally considered to be detrimental to image qualities 
(Jinwei Gu et al. 2010), can sample the flickering of the light signal hundreds or 
thousands of times faster than the global shutter. A faster sampling rate eliminates 
the need to worry about the ENF’s potential contamination due to aliasing. 
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We will explain the steps to extract the ENF signal starting from simple videos with 
no visual content in Sect. 10.4.2.2 to complex videos with camera motion, foreground 
motion, and brightness change in Sect. 10.4.2.3. 


10.4.2.1 Challenges of Using Global-Shutter Videos 


In Garg et al. (2013), the authors used a CCD camera to capture white-wall videos 
under indoor lighting to demonstrate the feasibility of extracting ENF signals from 
the visual track. The white-wall scene can be considered to contain no visual content 
except for an intensity bias, hence it can be used to demonstrate the essential steps 
to extract ENF from videos. The authors took the average of the pixel values in each 
H-by-W frame of the video and obtained a 1-D sinusoid-like time signal Sframe(t) 
for frequency estimation, namely 


Sframe(t) = I(x, y, t), t = 0, Peer (10.3) 


where H is the number of rows, W is the number of columns, Z is the video intensity, 
and x, y, and t are its row, column, and time indices, respectively. Here, the subscript 
“frame” of the signal symbol implies that its sampling is conducted at the frame level. 
Figure 10.7(a) and (b) shows the spectrograms calculated from the power signal and 
the time signal Sframe(t) of frame averaged intensity. It is revealed that the ENF 
estimated from the video has the same trend as that from the power mains and has 
doubled dynamic range, confirming the feasibility of ENF extraction from videos. 
One major challenge in using a global-shutter-based CCD camera for ENF esti- 
mation is the aliasing effect caused by the insufficient sampling rate. Most consumer 
digital cameras adopt a frame rate of around 30 fps, while the ENF signal and its har- 
monics appear at integer multiples of 100 120 Hz. The ENF signal, therefore, suffers 
from a severe aliasing effect due to an insufficient sampling rate. In Garg et al. (2013), 
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Fig. 10.7 Spectrograms of fluctuating ENF measured from a power mains, and b frames of a 
white-wall video recording (Garg et al. 2013) 
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Fig. 10.8 Issue of mixed frequency components caused by aliasing when global-shutter CCD 
cameras are used. (Left) Time-frequency representation of the original signal containing substantial 
signal contents around +120 Hz and DC (0 Hz); (Right) Mixed/aliased frequency components 
around DC after being sampled at 30 fps. Two mirrored components, in general, cannot be separated 
once mixed. Their overlap with the DC component further hinders the estimation of the desired 
frequency component 


the illumination was 100 Hz and the camera’s sampling rate was at 30 fps, resulting 
in an aliased component centered 10 Hz as illustrated in Fig. 10.7(b). Such a combi- 
nation of ENF nominal frequency and the CCD camera’s frame rate does not affect 
the ENF extraction. However, the 30-fps sampling rate can cause major difficulties 
in ENF estimation for global-shutter videos captured in 120-Hz ENF countries such 
as the US. We point out two issues using the illustrations in Fig. 10.8. First, two mir- 
rored frequency components cannot be easily separated once they are mixed. Since 
the power signal is real-valued, a minored —120 Hz component also exists in its 
frequency domain as illustrated in the left half of Fig. 10.8. When the frame rate is 
30 Hz, both +120 Hz components will be aliased to 0 Hz upon sampling, creating 
a symmetrically overlapping pattern at 0 Hz as shown in the right half of Fig. 10.8. 
Once the two desired frequency components are mixed, they cannot, in general, be 
separated without ambiguity hence making the ENF extraction impossible in most 
cases. Second, to make the matter worse, the native DC content around 0 Hz may 
further distort the mirrored and aliased ENF components (Garg et al. 2013), which 
may further hinder the estimation of the desired frequency component. 


10.4.2.2 Rolling-Shutter Videos with No Visual Content 


To address the challenge of insufficient sampling rate, Garg et al. (2013); Su et al. 
(2014a); Choi and Wong (2019) exploited the fact that a rolling shutter acquires rows 
of each frame in a sequential manner, which effectively upscales the sampling rate 
by a multiplicative factor of the number of rows in a frame. In the rolling-shutter 
scenario, the authors again use a white-wall video to demonstrate the essential steps 
of extracting ENF traces. As discussed in Sect. 1.4.1.2, the ENF is embedded by 
sequentially affecting the row intensities. To extract the ENF traces, it is intuitive to 
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average the similarly affected Intensity values within each row to produce one value 
indicating the impact of ENF at the timestamp at which the row is exposed. More 
precisely, after averaging, the resulting “video” data is indexed only by row x and 
time t: 


w-l 


1 
hos, D) = oF D1 1G,y,D, x=0,...,.#-1,t=0,1,..... (104) 
y=0 


Su et al. (2014a) define a 1-D row signal by concatenating J, (x, t) along time, 
namely 
Srow (n) = Trow(n mod H, floor(n/H)), n=0,1,... (10.5) 


Using a similar naming convention as in (10.3), the subscript in Srow (7n) implies that its 
sampling is equivalently conducted at the row level, which is hundreds or thousands 
of times faster than the frame rate. This allows the frequency estimation of ENF to 
be conducted on a signal of a much higher rate without suffering from potentially 
mixed signals due to aliasing. 

The multi-rate signal analysis (Parishwad 2006) is used in Su et al. (2014a) to 
analyze the concatenated signal (10.5) and shows that such direct concatenation by 
ignoring a frame’s idle period can result in slight distortion to the estimated ENF 
traces. To avoid such distortion, Choi and Wong (2019) show that the signal must 
be concatenated as if it were sampled uniformly in time. That is, the missing sample 
points due to the idle period need to be filled in with zeros before the concatenation. 
This zero-padding approach can produce undistorted ENF traces but requires the 
knowledge of the duration of the idle period ahead of time. The idle period is related 
to acamera-model specific parameter named read-out time (Hajj-Ahmad et al. 2016), 
which will be discussed in Sect. 10.5.4. The authors in Vatansever et al. (2019) further 
illustrate how the frequency of the main ENF harmonic is replaced with new ENF 
components depending on the length of the idle period. Their model reveals that the 
power of the captured ENF signal is inversely proportional to the idle period length. 


10.4.2.3 Rolling-Shutter Videos with Complex Visual Content 


In practical scenarios, an ENF signal in the form of light intensity variation coexists 
with the nontrivial video content reflecting camera motion, foreground motion, and 
brightness change. To be able to extract an ENF signal from such videos, the general 
principle is to construct a reasonable estimator for the video content V(x, y, t) and 
then subtract it from the original video J (x, y, t) in order to single out the light 
intensity variation due to ENF (Su et al. 2014a, b). The resulting ENF-only residual 
video E(x, y,t) = I(x, y, t) — Vix, y, t) can then be used for ENF extraction by 
following the procedure laid out in Sect. 10.4.2.2. A conceptual diagram is shown in 
Fig. 10.9. Below, we formalize the procedure to generate an ENF-only residual video 
using a static-scene example. The procedure is also applicable to videos with more 
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Fig. 10.9 A schematic of how ENF can be extracted from rolling-shutter videos 


complex scenes once motion estimation and compensation are conducted to estimate 
the visual content V(x, y, t). We will also explain how camera motion, foreground 
motion, and brightness change can be addressed. 


ENF-Only Residual Video Generation 
Based on the ENF embedding mechanism discussed in Sect. 10.4.1.2, we formulate 
an additive ENF embedding model? for rolling-shutter captured video as follows: 


T(x, y,t) = V(x, y, t) + E(x, t), (10.6) 


where V (x, y, t) is the visual content and E (x, t) is the ENF component that depends 
on row index x and time index t. For a fixed t, the ENF component E(x, t) is a 
sinusoid-like signal. 

Our goal is to estimate E(x, t) given a rolling-shutter captured video J (x, y, t) 
modeled by (10.6). Once we obtain the estimate E (x, t), one can choose to use the 
direct concatenation (Su et al. 2014a) or the periodic zero padding (Choi and Wong 
2019) to generate a 1-D time signal for frequency estimation. 

Dangling modifier. An intuitive approach to estimate E(x, t), is to first obtain 
a reasonable estimate Vix, y, t) of the visual content and subtract it from the raw 
video I(x, y, t), namely 


E(x, y,t) = 1(x, y,t) — V(x, y, t). (10.7) 


? Although a multiplicative model may better describe the optophysical process of how ENF is 
modulated into the reflected light, the multiplicative model is not very analytically tractable for 
ENF analysis. The additive model is also more correct when the AC component in the digitized 
image is relatively weaker than the DC component as pointed out in Sect. 10.4.1.1. 
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For a static scene video, since the visual contents of every frame are identical, it 
is intuitive to obtain an estimator by taking the average of a random subset 7 c 
{0, 1, 2,...} of video frames 


° I 
Væ, y, t= 10y. (10.8) 
teT 


Such a random average can help cancel out the ENF component in (10.6). Once 
E(x, y, t) is obtained, the same procedure of averaging over the column index (Su 
et al. 2014a) as in (10.4) is conducted to estimate E(x, t), as the pixels of the same 
row in Ê (x, y, t) are perturbed by the ENF in the same way: 


w-l 
Ë (x, t) = = >` E(x, y, t) (10.9a) 
w y=0 
1 w-l 
= E(x,t)+ T ae y,t) — V(x, y, t)]. (10.9b) 


When > V (x, y, t) is a good estimator for ka >; V(x, y, t) for all (x, t), the 


second term is close to zero, making É (x, t) a good estimator for E (x, t). To extract 
the ENF signal, E(x, t) is then vectorized into a 1-D time signal as in (10.5), namely 


Srow (n) = E(n mod H, floor(n/H)), n=0,1,... (10.10) 
Camera Motion 


For videos with more complex scenes, appropriate video processing tools need to be 
used to generate the estimated visual content V(x, y, t). We first consider the class 
of videos containing merely camera motions such as panning, rotation, zooming, and 
shaking. Two frames that are not far apart in time can be regarded as being generated 
by almost identical scenes projected to the image plane with some offset in the image 
coordinates. The relationship of the two frames can be established by finding each 
pixel of the frame of interest at to and its new location in a frame at tı, namely 


I(x + dx(x, y, ti), y + dy(x, y, ti), to) x I(x, y, tı), (10.11) 


where (dx, dy) is the motion vector associated with the frame of interest pointing 
to the new location in the frame at tı. Motion estimation and compensation are 
carried out in Su et al. (2014a), Su et al. (2014b) using optical flow to obtain a set of 
estimated frames L(x, y, to) for the frame of interest I(x, y, to). As the ENF signal 
has a relatively small magnitude compared to the visual content and the motion 
compensation procedure often introduces noise, an average over all these motion- 
compensated frames, namely 
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should lead to a good estimation of the visual content per the law of large numbers. 


Foreground Motion 

We now consider the class of videos with foreground motions only. In this case, 
the authors in Su et al. (2014a), Su et al. (2014b) use a video’s static regions to 
estimate the ENF signal. Given two image frames, the regions that are not affected 
by foreground motion in both frames are detected by thresholding the pixel-wise 
differences. The row averages of the static regions are then calculated to generate a 
1-D time signal for ENF estimation. 


Brightness Compensation 
Many cameras are equipped with an automatic brightness control unit that would 
adjust a camera’s sensitivity in response to the changes of the global or local light 
illumination. For example, as a person in a bright background moves closer to the 
camera, the background part of the image can appear brighter when the control unit 
relies on the overall intensity of the frame to adjust the sensitivity (Su et al. 2014b). 
Such a brightness change can bias the estimated visual content and lead to a failure 
in the content elimination process. 

Su et al. (2014b) found that the intensity values of two consecutive frames roughly 
follow the linear relationship 


I(x, y,t) © Bi@I@,y,t+ 1) + Bot), Yæ, y), (10.13) 


where 6; (t) is a slope and Ao (t) is an intercept for the frame pair (I (x,y, t), I(x, y, 
t+ 1). By estimating the parameters, background intensity values of different 
frames can be matched to the same level, which allows a precise visual content 
estimation and elimination. 


10.4.3 ENF Extraction from a Single Image 


In an extension to ENF extraction from videos, Wong et al. (2018) showed that 
ENF can even be extracted from a single image taken by a camera with a rolling 
shutter. As with rolling-shutter videos, each row of an image is sequentially acquired 
and contains the temporal flickering component due to the ENF. The acquisition of 
all rows within an image happens during the frame period that is usually 1/30 or 
1/25 based on the common frame rates, which is too short to extract a meaningful 
temporal ENF signal for most ENF-based analysis tasks. However, an easier binary 
classification problem can be answered (Wong et al. 2018): Is it possible to tell 
whether the capturing geographic region of an image has 50 Hz or 60 Hz supply 
frequency? 
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There are severalunsolved research challenges in ENF from images. For example, 
the proposed ENF classifier in Wong et al. (2018) works well for images with syn- 
thetically added ENF, but its performance on real images may be further improved by 
incorporating a more sophisticated physical embedding model. Moreover, instead of 
making a binary decision, one may construct a ternary classifier to test whether ENF 
is present in a given image. Lastly, one may take advantage of a few frames captured 
in a continuous shooting mode to estimate the rough shape of the ENF signal for 
authentication purposes. 


10.5 Key Applications in Forensics and Security 


The ENF signal has been shown to be a useful tool in solving several problems that 
are faced in digital forensics and security. In this section, we highlight major ENF- 
based forensic applications discussed in the ENF literature in recent years. We start 
with joint time—location authentication, and proceed to discuss ENF-based tampering 
detection, ENF-based localization of media signals, and ENF-based approaches for 
camera forensics. 


10.5.1 Joint Time—Location Authentication 


The earliest works on ENF-based forensic applications have focused on ENF-based 
time-of-recording authentication of audio signals (Grigoras 2005, 2007; Cooper 
2008; Kajstura et al. 2005; Brixen 2007a; Sanders 2008). The ENF pattern extracted 
from an audio signal should be very similar to the ENF pattern extracted from a power 
reference recording simultaneously recorded. So, if a power reference recording is 
available from a claimed time-of-recording of an audio signal, the ENF pattern can 
be extracted from the audio signal and compared against the ENF pattern from the 
power recording. If the extracted audio ENF pattern is similar to the reference pattern, 
then the audio recording’s claimed timestamp is deemed authentic. 

To discern the measure of similarity between the two ENF patterns, the minimum 
mean squared error (MMSE) or Pearson’s correlation coefficient can be used. Certain 
modifications to these matching criteria have been proposed in the literature. 

One such modification to the matching criteria was proposed in Garg et al. (2012) 
where the authors exploit their findings that the US ENF signal can be modeled 
as a piecewise linear AR process to achieve better matching in ENF-based time-of- 
recording estimation. In this work, the authors extract the innovation signals resulting 
from the AR modeling of the ENF signals, and use these innovation signals for 
matching instead of the original ENF signals. Experiments done under a hypothesis 
detection framework show that this approach provides higher confidence in time-of- 
recording estimation and verification. 
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Other matching approaches have been proposed as well. For instance, in Hua 
(2014), the authors proposed a new threshold-based dynamic matching algorithm, 
termed the error correction matching algorithm (ECM), to carry out the matching. 
Their approach accounts for noise affecting ENF estimates due to limited frequency 
resolution. In their paper, they illustrate the advantages of their proposed approach 
using both synthetic and real signals. The performance of this ECM approach, 
along with a simplified version of it termed the bitwise similarity matching (BSM) 
approach, is compared against the conventional MMSE matching criterion in Hua 
(2018). The authors find that due to the complexity of practical situations, the per- 
formance of the ECM and BSM approaches over the benchmark MMSE approach 
may not be guaranteed. However, in the situations examined in the paper, the find- 
ing is that ECM results in the most accurate matching results while BSM sacrifices 
matching accuracy for processing speed. 

In Vatansever et al. (2019), the authors propose a time-of-recording authentication 
approach tailored to ENF signals extracted from videos taken with cameras using 
rolling shutters. As mentioned in Sect. 10.4.1.2, such cameras typically have an idle 
period between each frame, thus resulting in missed ENF samples in the resultant 
video. In Vatansever et al. (2019), and based on multiple idle period assumptions, 
missing illumination periods in each frame are interpolated to compensate for the 
missing samples. Each interpolated time series then yields an ENF signal that can be 
matched to the ground-truth ENF reference through correlation coefficients to find 
or verify the time-of-recording. 

An example of a real-life case where ENF-based time-of-recording authentication 
was used was described in Kajstura et al. (2005). In 2003, the Institute of Forensic 
Research in Cracow, Poland, was asked to investigate a 55-min long recording of a 
conversation between two businessmen made using a Sony portable digital recorder 
(model: ICD-MS515). The time of the creation of the recording indicated by the 
evidential recorder differed by close to 196 days from the time reported by witnesses 
of the conversation. Analysis of the audio recording revealed the presence of ENF 
traces. The ENF signal extracted from the audio recording was compared against 
reference ENF signals provided by the power grid operator, and it was revealed that 
the true time of the creation of the recording was the one reported by the witnesses. 
In this case, the date/time setting on the evidential recorder at the time of recording 
was incorrect. 

A broader way to look at this particular ENF forensics application is not just 
as a time-of-recording authentication but as a joint time—location authentication. 
Consider the case where a forensic analyst is asked to verify, or discover, the correct 
time-of-recording and the location-of-recording of a media recording. In addition 
to authenticating the time-of-recording, the analyst may as well be able to identify 
the location-of-recording on a power grid level. Provided the analyst has access to 
ENF reference data from candidate times-of-recording and candidate grids-of-origin, 
comparing the ENF pattern extracted from the media recording to the reference ENF 
patterns extracted from candidate time/location ENF patterns should point to the 
correct time-of-recording and grid-of-origin of the media recording. 
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This application was the first that piqued the interest of the forensics commu- 
nity in the ENF signal. Yet, wide application in practical settings still faces several 
challenges. 

First, the previous discussion has assumed that the signal has not been tampered 
with. If it had been tampered with, the previously described approach may not be 
successful. Section 10.5.2 describes approaches to verify the integrity of a media 
recording using its embedded ENF. 

Second, depending on the initial information available about the recording, 
exhaustively matching the extracted media ENF pattern against all possible ENF pat- 
terns may be too expensive. An example of a recent work addressing this challenge is 
Pop et al. (2017), where the authors proposed computing a bit sequence encoding the 
trend of an ENF signal when it is added into the ENF database. When presented with 
a query ENF signal to be timestamped, its trend bit sequence is computed and com- 
pared to the reference trend bit sequences as a way to prune its candidate timestamps, 
and reduce the final number of reference ENF signals it would need to be compared 
against using conventional approaches such as MMSE and correlation coefficient. 

Third, this application assumes the availability of reference ENF data from the 
time and location of the media recording under study. In the case where reference 
data is incomplete or unavailable, a forensic analyst will need to resort to other 
measures to carry out the analysis. In Sect. 10.5.3, one such approach is described, 
which infers the grid-of-origin of a media recording without the need for concurrent 
reference ENF data. 


10.5.2 Integrity Authentication 


Audio forgery techniques can be used to conduct piracy over the Internet, falsify 
court evidence, or modify security device recordings or recordings of events taking 
place in different parts of the world (Gupta et al. 2012). This is especially relevant 
today with the prevalent use of social media by people who add on to news coverage 
of global events by sharing their own videos and recordings of what they experienced. 

In some cases, it is instrumental to be able to ascertain whether recordings are 
authentic or not. Transactions involving large sums of money, for example, can take 
place over the phone. In such cases, it is possible to change the numbers spoken in 
a previously recorded conversation and then replay the message. If the transaction 
is later challenged in court, the forged phone conversation can be used as an alibi 
by the accused to argue that the account owner did authorize the transaction (Gupta 
et al. 2012). This would be a case where methods to detect manipulation in audio 
recordings are necessary. In this section, we discuss the efforts made to use the 
captured ENF signal in audio recordings for integrity authentication. 

The basic idea of ENF-based authentication is that if an ENF-containing record- 
ing has been tampered with, the changes made will affect the extracted ENF signal 
as well. Examples of tampering include removing parts of the recording and insert- 
ing parts that do not belong. Examining the ENF of a tampered signal would reveal 
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Fig. 10.11 Audio signals from Brazil bandpassed around the 60 Hz value, a filtered original signal, 
and b filtered edited signal (Nicolalde and Apolinario 2009) 


discontinuities in the extracted ENF signal that raise the question of tampering. The 
easiest scenario in which to check for tampering would be the case where we have 
reference ENF available. In such a case, the ENF signal can be compared to the ref- 
erence ENF signal from the recording’s claimed time and location. This comparison 
would either support or question the integrity of the recording. Figure 10.10 shows 
an example where comparing an ENF signal extracted from a video recording with 
its reference ENF signal reveals the insertion of a foreign video piece. 
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The case of identifying tampering in ENF-containing recordings with no reference 
ENF data becomes more complicated. The approaches studied in the literature have 
focused on detecting ENF tampering based on ENF phase and/or magnitude changes. 
Generally speaking, when an ENF-containing recording has been tampered with, it 
is highly likely to find discontinuities in the phase of the ENF at regions where the 
tampering took place. 

In Nicolalde and Apolinario (2009), the authors noticed that phase changes in the 
embedded ENF signal result in a modulation effect when the recording containing 
the ENF traces is bandpassed about the nominal ENF band. An example of this is 
shown in Fig. 10.11, where there are visible differences in the relative amplitude in 
an edited signal versus in its original version. Here, the decrease in amplitude occurs 
at locations where the authors had introduced edits into the audio signal. 

Plotting the phase of an ENF recording will also reveal changes indicating tam- 
pering. In Rodriguez et al. (2010), the authors describe their approach to compute the 
phase that goes as follows. First, a signal is downsampled and then passed through 
a bandpass filter centered around the nominal ENF value. If no ENF traces are 
found at the nominal ENF, due to attacks aimed at avoiding forensics analysis, for 
instance (Chuang et al. 2012; Chuang 2013), bands around the higher harmonics of 
the nominal ENF can be used instead (Rodriguez 2013). Afterward, the filtered signal 
is divided into overlapping frames of Nc cycles of the nominal ENF, and the phase 
of each segmented frame is computed using DFT or using a high-precision Fourier 
analysis method called DFT! (Desainte-Catherine and Marchand 2000). When the 
phase of the signal is plotted, one can not only visually inspect for tampering but can 
also glean insights as to whether the editing was the result of a fragment deletion or 
of a fragment insertion, as can be seen in Fig. 10.12. 

Rodriguez et al. (2010) also proposed an automatic approach for discriminat- 
ing between an edited signal and an unedited one, which follows from the phase 
estimation just described. The approach depends on a statistic F defined as 


N 
F = 100log (= == > ó(n) — mol , (10.14) 
n=2 


where N is the number of frames used for phase estimation, ó (n) denotes the esti- 
mate phase of frame n, and mg is the average of the computed phases. The process 
of determining whether or not an ENF-containing recording is authentic is formu- 
lated under a detection framework with the null hypothesis Hp and the alternative 
hypothesis H, denoting an original signal and an edited signal, respectively. If F is 
greater than a threshold y, then the alternative hypothesis H, is favored; otherwise, 
the null hypothesis Ho is favored. 

For optimal detection, the goal is to obtain a value for the threshold y that max- 
imizes the value of the probability of detection Pp = Pr (F > y | Hı). To do so, 
the authors in Rodriguez et al. (2010) prepare a corpus of audio signals, including 
original and edited signals, and evaluate the database with the proposed automatic 
approach with a range of y values. The value chosen for y is the one it takes at the 
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Fig. 10.12 Phase estimated using DFT from edited Spanish audio signals where the edits were a 
fragment deletion and b fragment insertion (Rodriguez et al. 2010) 


equal error rate (EER) point where the probability of miss Py = 1 — Pp is equal 
to the probability of false alarm Ppa = Pr (F > y | Ho). Experiments carried out 
on the Spanish databases AHUMADA and GAUDI resulted in an EER of 6%, and 
experiments carried out on Brazilian databases Carioca 1 and Carioca 2 resulted in 
an EER of 7%. 

Esquef et al. (2014), the authors proposed a novel edit detection method for 
forensic audio analysis that includes modification on their approach in Rodriguez 
et al. (2010). In this version of the approach, the detection criterion is based on 
unlikely variations of the ENF magnitude. A data-driven magnitude threshold is 
chosen in a hypothesis framework similar to the one in Rodriguez et al. (2010). The 
authors conduct a qualitative evaluation of the influences of the edit duration and 
location as well as noise contamination on the detection ability. This new approach 
achieved a 4% EER on the Carioca 1 database versus 7% EER on the same database 
for the previous approach of Rodriguez et al. (2010). The authors report that their 
results indicate that amplitude clipping and additive broadband noise severely affect 
the performance of the proposed method, and further research is needed to improve 
the detection performance in more challenging scenarios. 

The authors further extended their work in Esquef et al. (2015). Here, they modi- 
fied the detection criteria by taking advantage of the typical pattern of ENF variations 
elicited by audio edits. In addition to the threshold-based detection strategy of Esquef 
et al. (2014), a verification of the pattern of the anomalous ENF variations is carried 


264 A. Hajj-Ahmad et al. 


Start (User Inputs) 
Search Scope 


Audio Recording ENF ENF Reference 
Extraction Module Database Cloud 

Authentication Modul Reference ENF 
PDE Re OT ORD Extraction Module 


Tampered? 


No 
Timestamp Info. Tampered Region 
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out, making the edit detector less prone to false positives. The authors confronted 
this newly proposed approach with that of Esquef et al. (2014) and demonstrated 
experimentally that the new approach is more reliable as it tends to yield lower EERs 
such as a reduction from 4% EER to 2% EER on the Carioca | database. Experi- 
ments carried out on speech databases degraded with broadband noise also support 
the claim that the modified approach can achieve better results than the original one. 

Other groups have also addressed and extended the explorations into ENF-based 
integrity authentication and tampering detection. We mention some highlights in 
what follows. 

In Fuentes et al. (2016), the authors proposed a phase locked loop (PLL)-based 
method for determining audio authenticity. They use a voltage-controlled oscillator 
(VCO) in a PLL configuration that produces a synthetic signal similar to that of a 
preprocessed ENF-containing signal. Some corrections are made to the VCO signal 
to make it closer to the ENF, but if the ENF has strong phase variations, there will 
remain large differences between the ENF signal and the VCO signal, signifying 
tampering. An automatic threshold-based decision on the audio authenticity is made 
by quantifying the frequency variations of the VCO signal. The experimental results 
shown in this work show that the performance of the proposed approach is on par 
with that of previous works (Rodriguez et al. 2010; Esquef et al. 2014), achieving, 
for instance, 2% EER on the Carioca 1 database. 
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In Huaetal. (2016), the authors present a solution to the ENF-based audio authen- 
tication system that jointly performs timestamp verification (if tampering is not 
detected) and detection of tampering type and tampering region (if tampering is 
detected). A high-level description of this system is shown in Fig. 10.13. The authen- 
tication tool here is an absolute error map (AEM) between the ENF signal under 
study and reference ENF signals from a database. The AEM is quantitatively a matrix 
containing all the raw information of absolute errors where the extracted ENF signal 
is matched to each possible shift of a reference signal. The AEM-based solutions 
rely coherently on the authentic and trustworthy portions of the ENF signal under 
study to make their decisions on authenticity. If the signal is deemed authentic, the 
system returns the timestamp at which a match is found. Otherwise, the authors pro- 
pose two variant approaches that analyze the AEM to find the tampering regions and 
characterize the tampering type (insertion, deletion, or splicing). The authors frame 
their work here as a proof-of-concept study, demonstrating the effectiveness of their 
proposal by synthetic performance analysis and experimental results. 

In Reis et al. (2016), Reis et al. (2017), the authors propose their ESPRIT-Hilbert- 
based tampering detection scheme with SVM classifier (SPHINS) framework for 
tampering detection, which uses an ESPRIT-Hilbert ENF estimator in conjunction 
with an outlier detector based on the sample kurtosis of the estimated ENF. The 
computed kurtosis values are vectorized and applied to an SVM classifier to indicate 
the presence of tampering. They report a4% EER performance on the clean Carioca 1 
database and that their proposed approach gives improved results, as compared to 
those in Rodriguez et al. (2010); Esquef et al. (2014), for low SNR regimes and in 
scenarios with nonlinear digital saturation, when applied to the Carioca 1 database. 

In Lin and Kang (2017), the authors propose an approach where they apply a 
wavelet filter to an extracted ENF signal to reveal the detailed ENF fluctuations 
and then carry out autoregressive (AR) modeling on the resulting signal. The AR 
coefficients are used as input features for an SVM system for tampering detection. 
They report that, as compared to Esquef et al. (2015), their approach can achieve 
improved performance in noisy conditions and can provide robustness against MP3 
compression. 

In the same vein as ENF-based integrity authentication, Su et al. (2013); Lin et al. 
(2016) address the issue of recaptured audio signals. The authors in Su et al. (2013) 
demonstrate that recaptured recordings may contain two sets of ENF traces, one from 
the original time of the recording and the other due to the recapturing process. They 
tackle the more challenging scenario that the two traces overlap in frequency by 
proposing a decorrelation-based algorithm to extract the ENF traces with the help of 
the power reference signal. Lin et al. (2016) employ a convolutional neural network 
(CNN) to decide if a certain recording is recaptured or original. Their deep neural 
network relies on spectral features based on the nominal ENF and its harmonics. 
Their paper goes into further depth, examining the effect of the analysis window on 
the approach’s performance and visualizing the intermediate feature maps to gain 
insight on what the CNN learns and how it makes its decisions. 
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10.5.3 ENF-Based Localization 


The ENF traces captured by audio and video signals can be used to infer information 
about the location in which the media recording was taken. In this section, we discuss 
approaches to use the ENF to do inter-grid localization, which entails inferring the 
grid in which the media recording was made, and intra-grid localization, which 
entails pinpointing the location-of-recording of the media signal within a grid. 


10.5.3.1 Inter-Grid Localization 


This section describes a system that seeks to identify the grid in which an ENF- 
containing recording was made, without the use of concurrent power references Hajj- 
Ahmad et al. (2013), Hajj-Ahmad et al. (2015). Such a system can be very important 
for multimedia forensics and security. It can pave the way to identify the origins of 
videos such as those of terrorist attacks, ransom demands, and child exploitation. 
It can also reduce the computational complexity and facilitate other ENF-based 
forensics applications, such as the time-of-recording authentication application of 
Sect. 10.5.1. For instance, if a forensic analyst is given a media recording of unknown 
time and location information, inferring the grid-of-origin of the recording would 
help in narrowing down the set of power reference recordings that need to be used 
to compare the media ENF pattern. 

Upon examining ENF signals from different power grids, it can be noticed that 
there are differences between them in the nature and the manner of the ENF vari- 
ations (Hajj-Ahmad et al. 2015). These differences are generally attributed to the 
control mechanisms used to regulate the frequency around the nominal value and the 
size of the power grid. Generally speaking, the larger the power grid is, the smaller 
the frequency variations are. Figure 10.14 shows the 1-min average frequency during 
a 48-hour period from five different locations in five different grids. As can be seen in 
this figure, Spain, which is part of the large continental European grid, shows a small 
range of frequency values as well as fast variations. The smallest grids of Singapore 
and Great Britain show relatively large variations in the frequency values (Bollen 
and Gu 2006). 

Given an ENF-containing media recording, a forensic analyst can process the 
captured ENF signal to extract its statistical features to facilitate the identification 
of the grid in which the media recording was made. A machine learning system can 
be built that learns the characteristics of ENF signals from different grids and uses 
it to classify ENF signals in terms of their grids-of-origin. Hajj-Ahmad et al. (2015) 
have developed such a machine learning system. In that implementation, ENF signal 
segments are extracted from recordings 8 min long each. Statistical, wavelet-based, 
and linear-predictive-related features are extracted from these segments and used to 
train an SVM multiclass system. 

The authors make a distinction between “clean” ENF data extracted from power 
recordings and “noisy” ENF data extracted from audio recordings, and various com- 
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Fig. 10.14 Frequency variations measured in Sweden (top left), in Spain (top center), on the Chinese 
east coast (top right), in Singapore (bottom left), and in Great Britain (bottom right) (Bollen and 
Gu 2006) 


binations of these two sets of data are used in different training scenarios. Overall, the 
authors were able to achieve an average accuracy of 88.4% on identifying ENF signals 
extracted from power recordings from eleven candidate power grids, and an average 
accuracy of 84.3% on identifying ENF signals extracted from audio recordings from 
eight candidate grids. In addition, the authors explored using multi-conditional sys- 
tems that can adapt to cases where the noise conditions of the training and testing 
data are different. This approach was able to improve the identification accuracy of 
noisy ENF signals extracted from audio recordings by 28% when the training dataset 
is limited to clean ENF signals extracted from power recordings. 

This ENF-based application was the focus of the 2016 Signal Processing Cup 
(SP Cup), an international undergraduate competition overseen by the IEFE Signal 
Processing Society. The competition engaged participants from nearly 30 countries. 
334 students formed 52 teams that registered for the competition, among them, 
more than 200 students in 33 teams turned in the required submissions by the open 
competition deadline in January 2016. The top three teams from the open competition 
attended the final stage of the competition at ICASSP 2016 in Shanghai, China, to 
present their final work (Wu et al. 2016). 

Most student participants from the SP Cup have uploaded their final reports to 
SigPort (Signal Processing 2021). A number of them have also further developed their 
works and published them externally. A common improvement on the SVM approach 
originally proposed in Hajj-Ahmad et al. (2015) was to make the classification system 
a multi-stage one, with earlier stages distinguishing between 50 Hz grid versus 60 Hz 
grid, or an audio signal versus a power signal. An example of such a system is 
proposed in Suresha et al. (2017). 
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In Šarié et al. (2016), a team from Serbia examined different machine learning 
algorithms that can be used to achieve inter-grid localization, including K -nearest 
neighbors, random forests, SVM, linear perceptron, and neural networks. In their 
setup, the classifier that achieved the highest accuracy overall was random forest. 
They also explored adding additional features to those of Hajj-Ahmad et al. (2015), 
related to the extrema and rising edges in the ENF signal, which showed performance 
improvements between 3% and 19%. 

Aside from SP-Cup-related works, Jeon et al. (2018) demonstrated how their ENF 
map proposed in Jeon et al. (2018) can be utilized for inter-grid location identification 
as part of their LISTEN framework. Their experimental setup included identifying 
grid-of-origin of audio streams extracted from Skype and Torfan from seven power 
grids, achieving classification accuracies in the 85—90% range for audio segments of 
length 10—40 mins. Their proposed system also tackled intra-grid localization, which 
we will discuss next. 


10.5.3.2 Intra-Grid Localization 


Though the ENF variations at different points in the same grid are very similar, 
research has shown that there can be discernible differences between these variations. 
The differences can be due to the local load characteristics of a given city and the 
time needed to propagate a response to demand and supply in the load to other 
parts of the grid (Bollen and Gu 2006). System disturbances, such as short circuits, 
line switching, and generator disconnections, can be contributing causes to these 
different ENF values (Elmesalawy and Eissa 2014). When a significant change in 
the load occurs somewhere in the grid, it has a localized effect on the ENF in the 
given area. This change in the ENF then propagates across the grid at, typically, a 
speed of approximately 500 ms (Tsai et al. 2007). 

In Hajj-Ahmad et al. (2012), the authors conjecture that small and large changes in 
the load may cause location-specific signatures in local ENF patterns. Following this 
conjecture, such differences may be exploited to pinpoint the location-of-recording 
of a media signal within a grid. Due to the finite propagation speed of frequency 
disturbances across the grid, the ENF signal is anticipated to be more similar for 
locations close to each other as compared to locations that are farther apart. As an 
experiment, concurrent recordings of power signals were done in three cities in the 
Eastern North American grid shown in Fig. 10.15a. 

Examining the ENF signals extracted from the different power recordings, the 
authors found that they have a similar general trend but possible microscopic dif- 
ferences. To better capture these differences, the extracted ENF signals are passed 
through a highpass filter, and the correlation coefficient between the highpassed ENF 
signals is used as a metric to examine the location similarity between recordings. 
Figure 10.15b shows the correlation coefficients between various 500s ENF seg- 
ments from different locations at the same time. We can see that the closer the two 
city-of-origins are, the more similar their ENF signals are, and the farther they are 
apart, the less similar their ENF signals are. From here, one can start to think about 
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using the ENF signal as a location stamp and exploiting the relation between pairwise 
ENF similarity and distance between locations of recording. A highpass-filtered ENF 
query can be compared to highpass-filtered ENF signals from known city anchors in 
a grid as a means toward pinpointing the location-of-recording of an ENF query. 

The authors further examined the use of the ENF signal as an intra-grid location 
stamp in Garg et al. (2013), where they proposed a half-plane intersection method to 
estimate the location of ENF-containing recordings and found that the localization 
accuracy can be improved with the increase in the number of locations used as 
anchor nodes. This study conducted experiments on power ENF signals. With audio 
and video signals, the situation is more challenging because of the noisy nature of 
the embedded ENF traces. The location-specific signatures are best captured using 
instantaneous frequencies estimated at 1s temporal resolution, and reliable ENF 
signal extraction at such a high temporal resolution is an ongoing research problem. 
Nevertheless, there is a potential of being able to use ENF signals as a location stamp. 

Jeon et al. (2018) have included intra-grid localization as part of the capabilities of 
their LISTEN framework for inferring the location-of-recording of a target recording, 
which relies on an ENF map built using ENF traces collected from online streaming 
sources. After narrowing down the power grid to which a recording belongs through 
inter-grid localization, their intra-grid localization approach relies on calculating 
the Euclidean distance between a time-series sequence of interpolated signals in the 
chosen power grid and the target recording’s ENF signal. The approach then narrows 
down the location of recording to a specific region within the grid. 

In Chai et al. (2016); Yao et al. (2017); Cui et al. (2018), the authors rely on the 
FNET/GridEye system, described in Sect. 10.2.1, to achieve intra-grid localization. 
In these works, the authors work on ENF-based intra-grid localization for ENF 
signals extracted from clean power signals without using concurrent ENF power 
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references. Such intra-grid localization is made possible by exploiting geolocation- 
specific temporal variations induced by electromechanical propagation, nonlinear 
load, and recurrent local disturbance (Yao et al. 2017). 

In Cui et al. (2018), the authors apply features similar to those proposed in intra- 
grid localization approaches and rely on a random forest classifier for training. In 
Yao et al. (2017), the authors apply wavelet analysis to ENF signals and feed detail 
signals into a neural network for training. The results show that the system works 
better at larger geographical scales and when the training and testing ENF-containing 
recordings were recorded closer together in time. 


10.5.4 ENF-Based Camera Forensics 


An additional ENF-based application that has been proposed uses the ENF traces 
captured in a video recording to characterize the camera that was used to produce 
the video (Hajj-Ahmad et al. 2016). This is done within a nonintrusive framework 
that only analyzes the video at hand to extract a characterizing inner parameter of the 
camera used. The focus was on CMOS cameras equipped with rolling shutters, and 
the parameter estimated was the read-out time T,., which is the time it takes for the 
camera to acquire the rows of a single frame. T. is typically not listed in a camera’s 
user manual and is usually less than the frame period equal to the reciprocal of a 
video’s frame rate. 

This work was inspired by prior work on flicker-based video forensics that 
addresses issues in the entertainment industry pertaining to movie piracy-related 
investigations (Baudry et al. 2014; Hajj-Ahmad 2015). The focus of that work was 
on pirated videos that are produced by camcording video content displayed on an 
LCD screen. Such pirated videos commonly exhibit an artifact called the flicker sig- 
nal, which results from the interplay between the backlight of the LCD screen and 
the recording mechanism of the video camera. In Hajj-Ahmad (2015), this flicker 
signal is exploited to characterize the LCD screen and camera producing the pirated 
video by estimating the frequency of the screen’s backlight signal and the camera’s 
read-out time To value. 

Both the flicker signal and the ENF signal are signatures that can be intrinsically 
embedded in a video due to the camera’s recording mechanism and the presence 
of a signal in the recording environment. For the flicker signal, the environmental 
signal is the backight signal of the LCD screen, while for the ENF signal, it is the 
electric lighting signal in the recording environment. In Hajj-Ahmad et al. (2016), the 
authors leverage the similarities between the two signals to adapt the flicker-based 
approach to an ENF-based approach targeted at characterizing the camera producing 
the ENF-containing video. The authors carried out an experiment involving ENF- 
containing videos produced using five different cameras, and the results showed that 
the proposed approach achieves high accuracy in estimating the discriminating T,, 
parameter with a relative estimation error within 1.5%. 
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Vatansever et al. (2019) further extend the work on the application by proposing an 
approach that can operate on videos that the approach in Hajj-Ahmad et al. (2016) 
cannot address, which are cases where the main ENF component is aliased 0 Hz 
when the nominal ENF is a multiple of the video camera’s frame rate, e.g., a video 
camera with 25 fps capturing 100 Hz ENF. The approach examines the two strongest 
frequency components at which the ENF component appears, following a model 
proposed in the same paper, and deduces the read-out time T,o based on the ratio of 
strength between these two components. 


10.6 Anti-Forensics and Countermeasures 


ENF signal-based forensic investigations such as time—location authentication, 
integrity authentication, and recording localization discussed in Sect. 10.5 rely on 
the assumption that the ENF signal buried in a hosting signal is not maliciously 
altered. However, there may exist adversaries performing anti-forensic operations to 
mislead forensic investigations. In this section, we examine the interplay between 
forensic analysts and adversaries to better understand the strengths and limitations 
of ENF-based forensics. 


10.6.1 Anti-Forensics and Detection of Anti-Forensics 


Anti-forensic operations by adversaries aim at altering the ENF traces to invalidate 
or mislead the ENF analysis. One intuitive approach is to superimpose an alternative 
ENF signal without removing the native ENF signal. This approach can confuse 
the forensic analyst when the two ENF traces are overlapping and have comparable 
strengths, under which it is impossible to separate the native ENF signal without using 
side information (Su et al. 2013; Zhu et al. 2020). A more sophisticated approach is 
to remove the ENF traces either in the frequency domain by using a bandstop filter 
around the nominal frequency (Chuang et al. 2012; Chuang 2013) or in the time 
domain by subtracting a time signal synthesized via frequency modulation based on 
some accurate estimates of the ENF signal such as those from the power mains (Su 
et al. 2013). The most malicious approach is to mislead the forensic analyst by 
replacing the native ENF signal with some alternative ENF signal (Chuang et al. 
2012; Chuang 2013). This can be done by first removing the native signal and then 
adding an alternative ENF signal. 

Forensic analysts are motivated to devise ways to detect the use of anti-forensic 
operations (Chuang et al. 2012; Chuang 2013), so that the forged signal can be 
rejected or be recovered to allow forensic analysis. In case an anti-forensic operation 
for ENF is carried out merely around the fundamental frequency, the consistency of 
the ENF components from multiple harmonic frequency bands can be used to detect 
such operation. The abrupt discontinuity in a spectrogram may be another indicator 
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for an anti-forensic operation. In Su et al. (2013), the authors formulated a composite 
hypothesis testing problem to detect the operation of superimposing an alternative 
ENF signal. 

Adversaries may respond to forensic analysts by improving their anti-forensic 
operations (Chuang et al. 2012; Chuang 2013). Instead of replacing the ENF com- 
ponent at the fundamental frequency, adversaries can attack all harmonic bands as 
well. The abrupt discontinuity check for spectrogram can be addressed by envelope 
adjustment via the Hilbert transform. Such dynamic interplay continues by evolving 
their actions in response to each other’s actions (Chuang et al. 2012; Chuang 2013), 
and actions tend to become more complex as the interplay goes on. 


10.6.2 Game-Theoretic Analysis on ENF-Based Forensics 


Chuang et al. (2012), Chuang (2013) quantitatively analyzed the dynamic interplay 
between forensic analysts and adversaries using the game-theoretic framework. A 
representative but simplified scenario that involves different actions was quantita- 
tively evaluated. The optimal strategies in terms of Nash equilibrium, namely the 
status in which no player can increase his/her own benefit (or formally, the utility) 
via unilateral strategy changes, were derived. The optimal strategies and the resulting 
forensic and anti-forensic performances at the equilibrium can lead to a comprehen- 
sive understanding of the capability of ENF-based forensics in a specific scenario. 


10.7 Applications Beyond Forensics and Security 


The ENF signal as an intrinsic signature of multimedia recordings gives rise to 
applications beyond forensics and security. For example, an ENF signal’s unique 
fluctuations can be used as a time—location signature to allow the synchronization of 
multiple pieces of media signals that overlap in time. This allows the synchronized 
signal pieces to jointly reveal more information than the individual pieces (Su et al. 
2014c,b; Douglas et al. 2014). In another example, the slowly varying trend of the 
ENF signal can serve as an anchor for the detection of abnormally recorded segments 
within a recording (Chang and Huang 2010; Feng-Cheng Chang and Hsiang-Cheh 
Huang 2011) and can thus allow for the restoration of tape recordings suffering from 
irregular motor speed (Su 2014). 


10.7.1 Multimedia Synchronization 


Conventional multimedia synchronization approaches rely on passive/active calibra- 
tion of the timestamps or identifying common contextual information from two or 
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more recordings. Voice and music are the contextual information that can be used 
for audio synchronization. Overlapping visual scenes, even recorded from different 
viewing angles, can be exploited for video synchronization. ENF signals naturally 
embedded in the audio and visual tracks of recordings can complement conventional 
synchronization approaches as they would not rely on common contextual informa- 
tion. 

ENF-based synchronization can be categorized into two major scenarios, one 
where multiple recordings for synchronization overlap in time, and one where they 
do not. If multiple recordings from the same power grid are recorded with time 
overlap, then they can be readily synchronized by the ENF without using other 
information. In Su et al. (2014c), Douglas et al. (2014), Su et al. (2014b), the authors 
use the ENF signals extracted from audio tracks for synchronization. For videos 
containing both the visual and audio tracks, it is possible to use either modality as 
the source of ENF traces for synchronization. Su et al. (2014b) demonstrate using 
the ENF signals from the visual track for synchronization. Using the visual track 
for multimedia synchronization is important for such scenarios as video surveillance 
cases where an audio track may not be available. 

In Golokolenko and Schuller (2017), the authors show that the idea can be 
extended to synchronizing the audio streams from wireless/low-cost USB sound 
cards in a microphone array. Sound cards have their own respective sample buffers, 
which fillup at different rates, thus leading to synchronization mismatches of streams 
by different sound cards. Conventional approaches to address the synchronization 
require specialized hardware, which is not a requirement for a synchronization 
approach based on using the captured ENF traces. 

If recordings were not captured with time overlap, then reference ENF databases 
are needed to help determine the absolute recording time of each piece. This falls 
back to the time-of-recording authentication scenario discussed in Sect. 10.5.1— 
Each recording will need to determine its timestamp by matching against one or 
multiple reference databases, depending on whether the source location of the record- 
ing is known. 


10.7.2 Time-Stamping Historical Recordings 


ENF traces can help archivists by providing a time-synchronized exhibit of multime- 
dia files. Many twentieth century recordings are important cultural heritage records, 
but some may lack necessary metadata, such as the date and the exact time of record- 
ing. Suet al. (2013), Suet al. (2014c), Douglas et al. (2014) found ENF traces in the 
1960s phone conversation recordings of President Kennedy in the White House, and 
in the multi-channel recordings of the 1970 NASA Apollo 13 mission. In Su et al. 
(2014c), the ENF was used to align two recordings of around 4 hours in length from 
the Apollo 13 mission, and the result was confirmed by the audio contents of the two 
recordings. 
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A set of time-synchronized exhibits of multimedia files with reliable metadata 
of recording time and high-SNR ENF traces can be used to reconstruct a reference 
ENF database for various purposes (Douglas et al. 2014). Such a reconstructed ENF 
database from a set of recordings can be valuable for applying ENFanalysisto sensing 
data collected in the past before power signals recorded for reference purposes is 
available. 


10.7.3 Audio Restoration 


Audio signal digitized from an analog tape has its content frequency “drifted” off 
the original value due to the inconsistent rolling speed between the recorder and 
the digitizer. The ENF signal, although varying along time, has a general trend 
that remains around the nominal frequency. When ENF traces are embedded in 
a tape recording, this consistent trend can serve as the anchor for correcting the 
abnormal speed of the recording (Su 2014). Figure 10.16a shows the drift of the 
general trend of the ENF followed by an abrupt jump in a recording from the NASA 
Apollo 11 Mission. The abnormal speed can be corrected by temporally stretching 
or compressing the frame segments of an audio signal using techniques such as 
multi-rate conversion and interpolation (Su 2014). The spectrogram of the resulting 
corrected signal is shown in Fig. 10.16b in which the trend of the ENF is constant 
without an abrupt jump. 


10.8 Conclusions and Outlook 


This chapter has provided an overview of the ENF signal, an environment signature 
that can be captured by multimedia recordings made in locations where there is elec- 
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Fig. 10.16 The spectrogram of an Apollo mission control recording a before and b after speed 
correction on the digitized signal (Su 2014) 
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trical activity. We first examined the embedding mechanism of ENF traces, followed 
by how ENF signals, can be extracted from audio and video recordings. We noted 
that in order to reliably extract ENF signals from visual recordings, it is often helpful 
to exploit the rolling-shutter mechanism commonly used by cameras with CMOS 
image sensor arrays. We then systematically reviewed how the ENF signature can 
be used for media recordings” time and/or location authentication, integrity authen- 
tication, and localization. We also touched on anti-forensics in which adversaries try 
to mislead forensic analysts and discussed countermeasures. Applications beyond 
forensics, including multimedia synchronization and audio restoration, were also 
discussed. 

Looking into the future, we notice many new problems have naturally arisen as 
technology continues to evolve. For example, given the popularity of short video clips 
that last for less than 30s, ENF analysis tools need to accommodate short recordings. 
The technical challenge lies in how to efficiently make use of the ENF information 
than traditional ENF analysis tools that focus on recordings of minutes or even hours 
long. 

Another technical direction is to push forward the limit of ENF extraction algo- 
rithms from working on videos to images. Is it possible to extract enough ENF 
information from a sequence of fewer than ten images captured over a few sec- 
onds/minutes to infer the time and/or location? For the case of ENF extraction from 
a single rolling-shutter image, is it possible to know more beyond whether the image 
is captured at a 50 or 60 Hz location? 

Exploring new use cases of ENF analysis based on its intrinsic ENF embedding 
mechanism is also a natural extension. For example, deepfake video detection may 
be a potential avenue for ENF analysis. ENF can be naturally embedded through 
the flickering of indoor lighting into the facial region and the background of a video 
recording. A consistency test may be designed to complement detection algorithms 
based on other visual/statistical cues. 
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Chapter 11 ®) 
Data-Driven Digital Integrity Verification | ss 


Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva 


Images and videos are by now a dominant part of the information flowing on the 
Internet and the preferred communication means for younger generations. Besides 
providing information, they elicit emotional responses, much stronger than text does. 
It is probably for these reasons that the advent of Al-powered deepfakes, realistic 
and relatively easy to generate, has raised great concern among governments and 
ordinary people alike. Yet, the art of image and video manipulation is much older 
than that. Armed with conventional media editing tools, a skilled attacker can create 
highly realistic fake images and videos, so-called “cheapfakes”. However, despite 
the nickname, there is nothing cheap in cheapfakes. They require significant domain 
knowledge to be crafted and their detection, can be much more challenging than the 
detection of Al-generated material. In this chapter, we focus on the detection of cheap- 
fakes, that is, image and video manipulations carried out with conventional means. 
However, we skip conventional detection methods, already thoroughly described in 
many reviews, and focus on the data-driven deep learning-based methods proposed in 
recent years. We look at manipulation detection methods under various perspectives. 
First, we analyze in detail the forensics traces they rely on, and then the main architec- 
tural solutions proposed for detection and localization, together with the associated 
training strategies. Finally, major challenges and future directions are highlighted. 


11.1 Introduction 


In the last few years, multimedia forensics has been drawing ever-increasing atten- 
tion in the scientific community and beyond. Indeed, with the editing software tools 
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available nowadays, modifying images or videos has become extremely easy. In the 
wrong hands, and used for the wrong purposes, this capability may represent major 
threats for both individuals and the society as a whole. In this chapter, we focus on 
conventional manipulations, also known as cheapfakes, in contrast to deepfakes that 
are based on the use of deep learning tools for their generation (Paris and Donovan 
2019). Generating a cheapfake does not require artificial-intelligence-based technol- 
ogy, however a skilled attacker can carry out very realistic fakes, which can have 
disruptive consequences. In Fig. 11.1, we show some examples of very common 
manipulations: inserting an object copied from a different image (splicing), replicat- 
ing an object coming from the same image (copy-move), or removing an object by 
extending the background (inpainting). Usually, to better fit the object in the scene, 
some suitable post-processing operations are also applied, like resizing, rotation, 
boundary smoothing, or color adjustment. 

A large number of methods have been proposed for forgery detection and local- 
ization, both for images and videos (Farid 2016; Korus 2017; Verdoliva 2020). Here, 
we focus on methods that verify the digital (as opposed to physical or semantic) 
integrity of the media asset, namely, discover the occurrence of a manipulation by 
detecting the pixel-level inconsistencies it caused. It is worth emphasizing that even 
well-crafted manipulations, which do not leave visible artifacts on the image, always 
modify its statistics, leaving traces that can be exploited by pixel-level analysis tools. 
In fact, the image formation process inside a camera comprises a certain number of 
operations, both hardware and software, specific of each camera, which leave dis- 
tinctive marks on each acquired image (Fig. 11.2). For example, camera models use 
widely different demosaicing algorithms, as well as different quantization tables 


Splicing Copy-Move Inpainting 


Pristine Image 


Forged Image 


Ground Truth 
= 


Fig. 11.1 Examples of image manipulations carried out using conventional media editing tools. 
From left to right: splicing (alien material inserted in the image), copy-move (an object has been 
cloned), inpainting (objects removed using background patches) 
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Fig. 11.2  In-camera processing pipeline. An image is captured using a complex acquisition sys- 
tem. The optical filter reduces undesired light components and the lenses focus the light on the 
sensor, where the red-green-blue (RGB) components are extracted by means of a color filter array 
(CFA). Then, a sequence of internal processing steps follow, including lens distortion correction, 
enhancement, demosaicing, and finally compression 


for JPEG compression. Their traces allow one to identify the camera model that 
acquired a given image. So, when parts of different images are spliced together, a 
spatial anomaly in these traces can be detected, revealing the manipulation. Like- 
wise, the post-camera editing process may introduce some specific artifacts, as well 
as disrupt camera-specific patterns. Therefore, by emphasizing possible deviations 
with respect to the expected behavior, one can establish with good confidence if the 
digital integrity has been violated. 

In the multimedia forensics literature, there has been intense work on detecting 
any possible type of manipulation, global or local, irrespective of its purpose. Sev- 
eral methods detect a large array of image manipulations, like resampling, median 
filtering, and contrast enhancement. However, these operations can also be carried 
out only to improve the image appearance or for other legitimate purposes, with no 
malicious intent. For this reason, in this chapter, we will focus on detecting local 
(semantically) relevant manipulations, like copy-moves, inpainting, and composi- 
tions, that can modify the meaning of the image. 

Early digital integrity method followed a model-based approach, relying on some 
mathematical or statistical models of data and attacks. More recent methods, instead, 
are for the most part data-driven, based on deep learning: they exploit the availability 
of large amounts of data to learn how to detect specific or generic forensic clues. 
These can be further classified based on whether they give as output an image-level 
detection score that indicates the probability that the image has been manipulated or 
not, or a pixel-level localization mask that provides a score for each pixel. In addition, 
we will distinguish methods based on the training strategy they apply: 


e two-class learning, where training is carried out on datasets of labeled pristine and 
manipulated data; 

e one-class learning, where only real data are used during training and manipulations 
are regarded as anomalies. 


In the rest of the chapter, we first analyze in detail the forensics clues exploited 
in the various approaches, then review the main architectural solutions proposed 
for detection and localization, together with their training strategies and the most 
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popular datasets proposed in this field. Finally, we draw conclusions and highlight 
major challenges and future directions. 


11.2 Forensics Clues 


Learning-based detection methods are typically designed on the basis of the specific 
forensic artifacts they look for. For example, some proposed convolutional neural 
networks (CNNs) base their decision on low-level features (e.g., camera-based), 
while others focus on higher level features (e.g., edge boundaries in a composition), 
and still others on features of both types. Therefore, this section will review the most 
interesting clues that can be used for digital integrity verification. 


11.2.1 Camera-Based Artifacts 


One powerful approach is to exploit the many artifacts that are introduced during 
the image acquisition process (Fig. 11.2). Such artifacts are specific of the individual 
camera (the device) or the camera model, and therefore represent some form of 
signature superimposed on the image, which can be exploited for forensic purposes. 
As an example, if part of the image is replaced with content taken from another 
source, this added material will lack the original camera artifacts, allowing detection 
of the attack. Some of the most relevant artifacts are related to the sensor, the lens, 
the color filter array, the demosaicing algorithm, and the JPEG quantization tables. 
In all cases, with the aim of analyzing weak artifacts, the image content (the scene) 
represents an undesired strong disturbance. Therefore, it is customary to work on the 
so-called noise residuals, obtained by suppressing the high-level content. To this end, 
one can estimate the “true” image by means of a denoising algorithm, and subtract it 
from the original image, or use some high-pass filters in the spatial or in the transform 
domain (Lyu and Farid 2005; He et al. 2012). 

A powerful camera-related artifact is the photo-response non-uniformity noise 
(PRNU), successfully used not only for forgery detection but also for source identi- 
fication (Chen et al. 2008). The PRNU pattern is due to unavoidable imperfections 
in the sensor manufacturing process. It is unique for each individual camera, stable 
in time, and present in all acquired images, therefore it can be considered as a device 
fingerprint. Using such a fingerprint, manipulations of any type can be detected. In 
fact, whenever the image is modified, the PRNU pattern is corrupted or even fully 
removed, a strong clue of a possible manipulation (Lukas et al. 2006). The main 
drawbacks of PRNU-based methods are 


(i) the need of images taken from the same camera under test to obtain a good 
estimate of the reference pattern and 
(ii) the need to have a PRNU pattern spatially aligned with the image under test. 
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Fig. 11.3 Training a noiseprint extraction network. Pairs of patches feed the two branches of the 
Siamese network that extract the corresponding noiseprint patches. The output distance must be 
minimized for patches coming from the same camera and spatial location, maximized for the others 


In Cozzolino and Verdoliva (2020), a learning-based strategy is proposed to 
extract a stronger camera fingerprint, the image noiseprint. Like the PRNU pattern, it 
behaves as a fingerprint embedded in all acquired images. Contrary to it, it depends 
on all camera-model artifacts which contribute to the digital history of an image. 
Noiseprints are extracted by means of a CNN, trained to capture, and emphasize all 
camera-model artifacts, while suppressing the high-level semantic content. To this 
end, the CNN is trained in a Siamese configuration, with two replicas of the network, 
identical for architecture and weights, on two parallel branches. The network is fed 
with pairs of patches drawn from a large number of pristine images of various camera 
models. When the patches on the two branches are aligned (same camera model and 
same spatial position, see Fig. 11.3) they can be expected to contain much the same 
artifacts. Therefore, the output of each branch can be used as reference for the input 
of the other one, by-passing the need for clean examples. Eventually, the network 
learns to extract the desired noiseprint, and can be used with no further supervision 
on images captured by any camera model, both inside and outside the training set. 

Of course, the extracted noiseprints will always contain traces of the imperfectly 
rejected high-level image, acting as noise for the forensic purposes. However, by 
exploiting all sorts of camera-related artifacts, noiseprints enable more powerful and 
reliable forgery detection procedures than PRNU patterns. On the down side, the 
noiseprint does not allow one to distinguish between two devices of the same camera 
model. Therefore, it should not be regarded as a substitute of the PRNU, but rather 
as a complementary tool. 

In Fig. 11.4, we show some examples of noiseprints extracted from different fake 
images, with the corresponding heat maps obtained by feature clustering. In the first 
case, the noiseprint clearly shows the 8 x 8 grid of the JPEG format. Sometimes, 
traces left by image manipulations on the extracted noiseprint are so strong to allow 
easy localization even by direct inspection. The approach can be also easily extended 
to videos (Cozzolino et al. 2019), but larger datasets are necessary to make up for the 
reduced quality of the sources, due to heavy compression. Interestingly, noiseprints 
can be also used in a supervised setting (Cozzolino and Verdoliva 2018). Given a 
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Fig. 11.4 Examples of manipulated images (top) with their corresponding noiseprints (middle) 
and heat maps obtained by a proper clustering (bottom) 
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Fig. 11.5 Noiseprint in supervised modality. The localization procedure is the classic pipeline used 
with PRNU-based methods: the noiseprint is extracted from a set of pristine images taken by the 
same camera of the image under analysis, their average represents a clean reference, namely, a 
reliable estimate of the camera noiseprint that can be used for image forgery localization 


suitable training set of images, a reliable estimate of the noiseprint is built, where 
high-level scene leakages are mostly removed, and used in a PRNU-like fashion to 
discover anomalies in the image under analysis (see Fig. 11.5). 

As we mentioned before, many different artifacts arise from the in-camera process- 
ing pipeline, related to sensor, lens, and especially the internal processing algorithms. 
Of course, one can also focus on just one of these artifacts to develop a manipulation 
detector. Indeed, this has been the dominant approach in the past, with many model- 
based methods proposed to exploit very specific features, leveraging a deep domain 
knowledge. In particular, there has been intense research on the CFA-demosaicing 
combination. Most digital cameras use a periodic color filter array, so that each indi- 
vidual sensor element records light only in a certain range of wavelengths (i.e., red, 
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Fig. 11.6 Histograms of the DCT coefficients clearly show the statistical differences between 
images subject to a single or double JPEG compression 


green, blue). The missing color information is then interpolated from surrounding 
pixels in the demosaicing process, which introduces a subtle periodic correlation 
pattern in all acquired images. Both CFA configuration and demosaicing algorithms 
are specific of each camera model, leading to different correlation patterns which 
can be exploited for forensic means. For example, when a region is spliced in a photo 
taken by another camera model, its periodic pattern will appear anomalous. Most of 
the methods proposed in this context are model-based (Popescu and Farid 2005; Cao 
and Kot 2009; Ferrara et al. 2012), however recently a convolutional neural network 
specifically tailored to the detection of mosaic artifacts has been proposed (Bammey 
et al. 2020). The network is trained only on authentic images and includes 3 x 3 
convolutions with a dilation of 2, so as to examine pixels that all belong to the same 
color channel and detect possible inconsistencies. In fact, a change of the mosaic can 
lead to forged blocks being detected at incorrect positions modulo (2, 2). 


11.2.2 JPEG Artifacts 


Many algorithms in image forensics exploit traces left by the compression process. 
Some methods look for anomalies in the statistical distribution of original DCT 
samples, assumed to comply with the Benford law (Fu et al. 2007). However, the 
most popular approach is to detect double compression traces (Bianchi and Piva 
2012). In fact, when a JPEG-compressed image undergoes a local manipulation 
and then is compressed again, double compression (or double quantization, DQ) 
artifacts appear all over the image (with very high probability) in the forged area. 
This phenomenon is exploited also in the well-known Error Level Analysis (ELA), 
widespread among practitioners for its simplicity. 

The effects of double compression are especially visible in the distributions of the 
DCT coefficients which, after the second compression, show characteristic periodic 
peaks and valleys (Fig. 11.6). Based on this observation, Wang and Zhang (2016) 
performs splicing detection by means of a CNN trained with histograms of DCT 
coefficients. The histograms are computed either from single-compressed patches 
(forged areas) or double-compressed ones (pristine areas), with quality factors in 


288 D. Cozzolino et al. 


the range 60-95. To reduce the dimensionality of the feature vector, only a small 
interval near the peak of each histogram is chosen to represent the whole histogram. 
Following this seminal paper, also Amerini et al. (2017) and Park et al. (2018) 
use histograms of DCT coefficients as input, while (Kwon et al. 2021) uses one- 
hot coded DCT coefficients as already proposed in Yousfi and Fridrich (2020) for 
steganalysis. Also the quantization tables can bring useful information, and are used 
as additional input in Park et al. (2018); Kwon et al. (2021). Leveraging domain- 
specific prior knowledge, the approach proposed in Barni et al. (2017) works on noise 
residuals rather than on image pixels, and uses the first layers of the CNN to extract 
histogram-related features. In Park et al. (2018), instead, the quantization tables from 
the JPEG header are used as input together with the histogram features. It is worth 
observing that methods based on deep learning provide good results also in cases 
where conventional methods largely fail, such as when test images are compressed 
with a quality factor (QF) never seen in training or when the second QF is larger than 
the first one. Double compression detectors have been also proposed for H.264 video 
analysis, with a two-stream neural network that analyzes separately intra-coded and 
predictive frames (Nam et al. 2019). 


11.2.3 Editing Artifacts 


The editing process often generates a trail of precious traces, besides artifacts related 
to re-compression. Indeed, when a new object is inserted in an image, it typically 
requires several post-processing steps to fit the new context smoothly. These include 
geometric transformations, like rotation and scaling, contrast adjustment, and blur- 
ring, to smooth the object-background boundaries. Both rotation and resampling, 
for example, require some forms of interpolation, which in turn generates periodic 
artifacts. Therefore, some methods (Kirchner 2008) look for traces of resampling 
as a proxy for possible forgeries. In this section, we describe the most used and 
discriminative editing-based traces exploited in forensics algorithms. 


Copy-moves 


A very common manipulation consists in replicating or hiding objects. Of course, the 
presence of identical regions is a strong hint of forgery, but clones are often modified 
to disguise traces, and near-identical natural objects also exist, which complicates 
the forensic analysis. Studies on copy-move detection date back to 2003, with the 
seminal work of Fridrich (2003). Since then, a large literature has grown on this 
topic proposing solutions that allow for copy-move detection even in the presence of 
rotation, resizing, and other geometric distortions (Christlein et al. 2012). Methods 
based on keypoints are very efficient, while block-based dense methods (Cozzolino 
et al. 2015) are more accurate and deal also with removals. 
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Fig. 11.7 Examples of additive copy-moves. From left to right: manipulated images and corre- 
sponding ground truths, binary localization maps obtained using a dense approach (Cozzolino et al. 
2015) and those obtained by means of a deep-learning-based one (Chen et al. 2020). The latter 
method allows telling apart the source and copied areas 


A first solution that relies on deep learning has been proposed in Wu et al. (2018), 
where an end-to-end CNN-based solution is able to jointly optimize the three main 
steps of a classic copy-move solution, that is, feature extraction, matching, and post- 
processing. To improve performance across different scenarios, multiscale feature 
analysis and hierarchical feature mapping are employed in Zhong and Pun (2019). 
Overall, deep-learning-based methods appear to outperform conventional methods 
on low-resolution images (i.e., small-size images) and can be designed to distinguish 
the copy from the original region. In fact, typically, copy-move detection methods 
generate a map where both the original object and its clone are highlighted, but do not 
establish which is which. The problem of source-target disambiguation is considered 
in Wu et al. (2018); Barni et al. (2019); Chen et al. (2020). The main idea originally 
proposed in Wu et al. (2018) is to use a CNN for manipulation detection and another 
one for similarity computation. Then a suitable fusion of these two outputs helps to 
distinguish the source region from its copies (Fig. 11.7). 


Inpainting 


Inpainting is a very effective image manipulation approach that can be used to 
hide objects/people. The main classic approaches for their detection rely on meth- 
ods inspired by copy-move detection, given that in the widespread exemplar-based 
inpainting, multiple small regions are copied from all over the image and combined 
together to cover the object. Patch-based inpainting is addressed in Zhu et al. (2018) 
with an encoder/decoder architecture which provides a localization map as output. 
In Li and Huang (2019), instead, a fully convolutional network is used to localize the 
inpainted area in the residual domain. The CNN does not look any specific editing 
artifact caused by inpainting, but only trained on a large number of examples of 
real and manipulated images, making the detection very effective for the analyzed 
methods. 
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Splicings 


Other approaches focus on anomalies which appear at the boundaries of objects 
when a composition is performed, due to the inconsistencies between regions drawn 
from different sources. For example, Salloum et al. (2018) uses a multi-task fully 
convolutional network comprising a specific branch for detecting boundaries between 
inserted regions and background and another one for analyzing the surface of the 
manipulation. Both in this work and in Chen et al. (2021), training is carried out 
also using the edge map extracted from the ground truth. In other papers (Liu et al. 
2018; Shi et al. 2018; Zhang et al. 2018; Phan-Xuan et al. 2019), instead, training 
is performed on patches that are either authentic or composed of pixels belonging 
to the two classes. Therefore, the CNN is forced to look for the boundaries of the 
manipulation. 


Face Warping 


In Wang et al. (2019), a CNN-based solution is proposed to detect artifacts introduced 
by a specific Photoshop tool, Face-Aware Liquify, which performs image warping 
of human faces. The approach is very successful for this task because the network 
is trained on manipulated images automatically generated by the very same tool. To 
deal with more challenging situations and increase robustness, data augmentation is 
applied by including resizing, JPEG compression, and various types of histogram 
editing. 


11.3 Localization Versus Detection 


Integrity verification can be carried out at two different levels: image-level (detec- 
tion) or pixel-level (localization). In the first case, a global integrity score is provided, 
which indicates the likelihood that a manipulation occurred anywhere in the image. 
The score can be a real number, typically an estimate of the probability of manipula- 
tion, or a binary decision (yes/no). Localization instead builds upon the hypothesis 
that a detector has already decided on the presence of a manipulation and tries to 
establish for each individual pixel if it has been manipulated. Therefore, the output 
has the same size of the image and, again, can be real valued, corresponding to a 
probability map or more in general a heat map, or binary, corresponding to a decision 
map (see Fig. 11.8). It is worth noting that most localization methods in practice also 
try to detect the presence of a manipulation and do not make any assumptions on 
the fake/pristine nature of the image. Many methods focus on localization, others 
on detection or on both tasks. A brief summary of the main characteristics of these 
approaches is presented in Table 11.1. In this section, we analyze the major direc- 
tions adopted in the literature to address these problems by using deep-learning-based 
solutions. 
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Fig. 11.8 The output of a method for digital integrity verification can be either a (binary or con- 
tinuous) global score related to the probability that the image has been tampered or a localization 
map, where the score is computed at pixel level 


11.3.1 Patch-Based Localization 


The images to analyze are usually much larger than the input layer of neural networks. 
To avoid resizing the input image, which may destroy precious evidence, several 
approaches propose to work on small patches, with size typically spanning from 
64x 64 to 256 x 256 pixels, and analyze the whole image patch by patch. Localization 
can be performed in different ways based on the learning strategy, as described in 
the following. 


Two-class Learning 


Supervised learning relies on the availability of a large dataset of both real and manip- 
ulated patches. Once the patch is classified at test time, localization is carried out by 
performing a sliding-window analysis of the whole image. Most of the early methods 
are designed in this way. For example, to detect traces of double compression, in Park 
et al. (2018) a dataset of single-compressed and double-compressed JPEG blocks was 
generated using a very large number of quantization tables. Clearly, the more diverse 
and varied the dataset is, the more effective the training will be, with strong impact on 
the final performance. Beyond double compression, one can be interested in detect- 
ing traces of other types of processing, like blurring or resampling. In this case, one 
has only to generate a large number of patches of such kind. The main issue is the 
variety of parameters one can consider and that does not allow to easily cover the 
whole space. Patch-based learning is also used to detect object insertion in an image, 
based on the presence of boundary artifacts. In this case, manipulated patches include 
pixels coming from two different sources (the host image and the alien object), with 
a simple edge separating the two regions (Zhang et al. 2018; Liu et al. 2018). 
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One-class learning 


Several methods in forensics look at the problem using an anomaly detection per- 
spective, developing a model of the pristine data and looking for anomalies with 
respect to this model, which can suggest the presence of a manipulation. The idea 
is to extract a specific type of forensic feature during training, whose distribution 
depends strongly on the source image or class of images. Then, at test time, these 
features are extracted from all patches of the image (with suitable stride) and clus- 
tered based on their statistics. If two distinct clusters emerge, the smallest one is 
regarded as anomalous, likely due to a local manipulation, and the corresponding 
patches provide a map of the tampered area. The most successful features used for 
this task are low level, like those related to the CFA pattern (Bammey et al. 2020), 
compression traces (Niu et al. 2021), or all the in-camera processing operations 
(Cozzolino and Verdoliva 2016; Bondi et al. 2017; Cozzolino and Verdoliva 2020). 
For example, some features allow one to classify different camera models with high 
reliability. In the presence of a splicing, the features observed in the pristine and the 
forged regions will be markedly different, indicating that different image parts were 
acquired by different camera models (Bondi et al. 2017; Yao et al. 2020). 

Autoencoders can be very useful to implement a one-class approach. In fact, a 
network trained on a large number of patches extracted from untampered images 
eventually learns to reproduce pristine images with good accuracy. On the contrary, 
anomalies will give rise to large reconstruction errors, pointing to a possible manipu- 
lation. Noise inconsistencies are used in Cozzolino and Verdoliva (2016) to localize 
splicings without requiring a training step. At test time, handcrafted features are 
extracted from the patches of the image and an iterative algorithm based on feature 
labeling is used to detect the anomaly region. The same approach has been extended 
to videos in D’ Avino et al. (2017), using an LSTM recurrent network to account for 
temporal dependencies. 

In Bammey et al. (2020), instead, the idea is to learn locally the possible configu- 
rations of a CFA mosaic. Whenever such pattern is not found in the sliding-window 
analysis at test time, an anomaly is identified, and then a possible manipulation. A 
similar approach can be followed with reference to the JPEG quality factor: in this 
case, if an inconsistency is found in the image, this means that the anomalous region 
was compressed with a different compression level than the rest of the image, a strong 
evidence of local modifications (Niu et al. 2021). 


11.3.2 Image-Based Localization 


Majority of approaches that carry out localization rely on fully convolutional net- 
works and leverage architectures developed originally for image segmentation, 
regarding the problem itself as a special form of segmentation between pristine and 
manipulated regions. A few methods, instead, take inspiration from object detection 
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and make decisions on regions extracted from a preliminary region proposal phase. 
In the following, we review the main ideas. 


Segmentation-like approach 


These methods take as input the whole image and produce a localization map of 
the same size. Therefore, they avoid the problem of post-processing a large number 
of patch-level results in some sensible way. On the down side, they need a large 
collection of fake and real images for training, as well as their associated pixel-level 
ground truth, two rather challenging requirements. Actually, many methods keep 
being trained on patches and keep working on patches at testing time, but all the 
pieces of information are processed jointly in the network to provide an image-level 
output, just as it happens with denoising networks. Therefore, for these methods, 
training can be performed at patch level by considering many different types of local 
manipulations (Wu et al. 2019; Liu et al. 2021) or including boundaries between the 
pristine and forged area (Salloum et al. 2018; Rao et al. 2021; Zhang and Ni 2020). 
Many other solutions, instead, use the information from the whole image by resizing 
it to a fixed dimension (Wu et al. 2018; Bappy et al. 2019; Mazaheri et al. 2019; Bi 
et al. 2019, 2020; Kniaz et al. 2019; Shi et al. 2020; Chen et al. 2020; Hu et al. 2020; 
Islam et al. 2020; Chen et al. 2021). The image is reduced to the size of the net- 
work input, and therefore fine-grained pixel-level dependencies typical of in-camera 
and out-camera artifacts are destroyed. Moreover, local manipulations may become 
extremely small after resizing and easily neglected in the overall loss. This approach 
is conceptually simple but causes a significant loss of information. To increase the 
number of training samples and generalize across a large variety of possible manip- 
ulations, in Zhou et al. (2019) an adversarial strategy is also adopted. Even the loss 
functions used in these works are typically inspired by semantic segmentation, such 
as pixel-wise cross-entropy (Bi et al. 2019, 2020; Hu et al. 2020), weighted cross- 
entropy (Bappy et al. 2019; Rao et al. 2021; Salloum et al. 2018; Zhang and Ni 2020), 
and dice (Chen et al. 2021). Since often the manipulation region is much smaller than 
the whole image in Li and Huang (2019), it is proposed to adopt the focal loss, which 
assigns a modulating factor to the cross-entropy term and then can address the class 
imbalance problem. 


Object Detection-Like Approach 


A different strategy is to identify only the region box that includes the manipulation 
adopting approaches that are applied for the object detection task. This path is first 
followed in Zhou et al. (2018) relying on the Region Proposal Network (RPN) (Ren 
et al. 2017) very popular for object detection. This is adapted to provide regions with 
potential manipulations, by using features extracted both from the RGB channels and 
from the noise residual as input. The approach proposed in Yang et al. (2020), instead, 
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adopts Mask R-CNN (Re et al. 2020) and includes an attention region proposal 
network to identify the manipulated region. 


11.3.3 Detection 


Despite the obvious Importance of detection, which should take place before local- 
ization is attempted, in the literature there has been limited attention to this specific 
problem. A large number of localization methods have been used to perform detection 
through a suitable post-processing of the localization heatmap aimed at extracting a 
global score. In some cases, the average or the maximum value of the heatmap is used 
as decision statistics (Huh et al. 2018; Wu et al. 2019; Rao et al. 2021). However, 
localization methods often perform clustering or segmentation in the target image 
and therefore they tend to find a forged area also in pristine images, generating a 
large number of false alarms. 

Other methods propose ad hoc information fusion strategies. Specifically, in Rao 
and Ni (2016); Boroumand and Fridrich (2018), training is carried out on a patch- 
level analysis then statistics-based information is aggregated for the whole image to 
form a feature vector as the input of a classifier, either a support vector machine or 
a neural network. Unfortunately, a patch-level analysis does not allow taking into 
account both local information (through textural analyses) and global information 
over the whole image (through contextual analyses) at the same time. That is, by 
focusing on a local analysis, the big picture may go lost. 

Based on these considerations, a different direction is followed in Marra et al. 
(2019). The network takes as input the whole image without any resizing, so as not 
to lose precious traces of manipulation hidden in its fine-grain structure. All patches 
are analyzed jointly, through a suitable aggregation, to make the final image-level 
decision on the presence of a manipulation. More important, also in the training phase, 
only image-level labels are used to update the network weights, and only image-level 
information back-propagates until the early patch-level feature extraction layers. To 
this end, a gradient checkpointing technique is used, which allows to keep in memory 
all relevant variables at the cost of a limited increase in computation. This approach 
allows for the joint optimization of patch-level feature extraction and image-level 
decision. The first layers extract features that are instrumental to carry out a global 
analysis and the last layers exploit this information to highlight local anomalies that 
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Image Activation map ROI-based localization 


Fig. 11.9 Example of localization analysis based on the detection approach proposed in Marra et al. 
(2019). From left to right: test image, corresponding activation map, and ROI-based localization 
result with ground-truth box (green) and automatic ones (magenta). Detection scores are shown on 
the corner of each box 


would not appear at the patch level. In Marra et al. (2019), this end-to-end approach 
is shown to also allow a good interpretation of results by means of activation maps 
and a good localization of forgeries (see Fig. 11.9). 


11.4 Architectural Solutions 


In this section, we describe the most popular and interesting deep-learning-based 
architectural solutions proposed in digital forensics. 


11.4.1 Constrained Networks 


To exploit low-level features related to camera-based artifacts, one needs to suppress 
the scene content. This can be done by pre-processing the input image, through a 
preliminary denoising step, or by means of suitable high-pass filters, like the popular 
spatial rich models (SRM) initially proposed in the context of steganalysis (Fridrich 
and Kodovsky 2012) and applied successfully in image forensics (Cozzolino et al. 
2014). Several CNN architectures try to include this pre-processing in their archi- 
tecture by means of a constrained first layer which performs the desired high-pass 
filtering. For example, in Rao and Ni (2016), a set of fixed high-pass filters inspired 
to the spatial rich model are used to compute residuals of the input patches, while 
in Bayar and Stamm (2016) the high-pass filters of the first convolutional layer are 
learnt using the following constraints: 


wx (0, 0) = —1 
mG = 0 ee 
ij 
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where ux (i, j) is the weight of k-th filter at the (i, j) position and (0, 0) indicates 
the central position. This solution is pursued also in other works, such as (Wu et al. 
2019; Yang et al. 2020; Chen et al. 2021), while many others use a fixed filtering 
(Li and Huang 2019; Barni et al. 2017; Zhou et al. 2018) or adopt a combination of 
fixed high-pass filters and trainable ones (Wu et al. 2019; Zhou et al. 2018; Bi et al. 
2020; Zhang and Ni 2020). 

A slightly different perspective is considered in Cozzolino et al. (2017) where a 
CNN architecture is built to replicate exactly the co-occurrence-based methods of 
Fridrich and Kodovsky (2012). All the phases of this procedure, i.e., extraction of 
noise residuals, scalar quantization, computation of co-occurrences, and histogram 
are implemented using standard convolutional or pooling layers. Once established 
the equivalence, the network is modified to enable its fine-tuning and further improve 
performance. Since the resulting network has a lightweight structure, fine-tuning can 
be carried out using a small training set, limiting computation time. 


11.4.2 Two-Branch Networks 


In order to exploit different types of clues at the same time, it is possible to adopt 
a CNN architecture that accepts multiple inputs (Zhou et al., 2018; Shi et al., 2018; 
Bi et al., 2020; Kwon et al., 2021). For example, a two-branch approach is proposed 
in Zhou et al. (2018) in order to take into account both low-level features extracted 
from noise residuals and high-level features extracted from RGB images. In Amerini 
et al. (2017); Kwon et al. (2021), instead, information extracted from the spatial 
and DCT domain is analyzed so as to take explicitly into account also compression 
artifacts. Two-branch solutions can also be used to produce multiple outputs, as 
done in Salloum et al. (2018), where the fully convolutional network has two output 
branches: one provides the localization map of splicing while the other provides a 
map containing only the edges of the forged region. In the context of copy-move 
detection, the two-branch structure is used to address separately the problems of 
detecting duplicates and disambiguating between the original object an its clones 
(Wt et al. 2018; Barni et al. 2019). 


11.4.3 Fully Convolutional Networks 


Fully convolution networks (FCNs) can be extremely useful since they preserve 
spatial dependencies and generate an output of the same size of the input image. 
Therefore, the network can be trained to output a heat map which allows to perform 
tampering localization. Various architectures have shown to be particularly useful 
for forensics applications. 

Some methods (Bi et al. 2019; Shi et al. 2020; Zhang and Ni 2020) rely on 
a plain U-Net architecture (Ronneberger et al. 2015), one of the most successful 
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network for semantic segmentation, initially proposed for biomedical applications. 
A more innovative variation of FCNs is proposed in Bappy et al. (2019); Mazaheri 
et al. (2019) where the architecture includes a long short-term memory (LSTM) 
module. The blocks of the image are sequentially processed by the LSTM. In this way, 
the network can model the spatial relationships between neighboring blocks, which 
facilitates detecting manipulated blocks as those that break the natural statistics of 
authentic ones. Other papers include spatial pyramid mechanisms in the architectures, 
in order to capture relationships at different scales of the image (Bi et al. 2020; Wu 
et al. 2019; Hu et al. 2020; Liu et al. 2021; Chen et al. 2021). 

Both LSTM modules and spatial pyramids aim at overcoming the intrinsic limits 
of networks based only on convolutions where, due to the limited overall receptive 
field, the decision on a pixel depends only on the information in a small region around 
the pixel. With the same aim, it is also possible to use attention mechanisms (Chen 
et al. 2008, 2021; Liu et al. 2021; Rao et al. 2021), as in Shi et al. (2020), where 
channel attention is used in the Gram matrix to measure the correlation between any 
two feature maps. 


11.4.4 Siamese Networks 


Siamese networks can be extremely useful for several forensic purposes. In fact, 
by leveraging the Siamese configuration, with two identical networks working in 
parallel, one can focus on the similarity and differences among patches, making up 
for the absence of ground-truth data. 

This approach is applied in Mayer and Stamm (2018), where a constrained network 
is used to extract camera-model features, while another network is trained to learn the 
similarity between pairs of such features. This strategy is further developed in Mayer 
and Stamm (2019) where a graph-based representation is introduced to better identify 
the forensic relationships among all patches within an image. A Siamese network 
is also used in Huh et al. (2018) to establish if image patches were captured with 
different imaging pipelines. The proposed approach is self-supervised and localizes 
image splicings by predicting the consistency of EXIF attributes between pairs of 
patches in order to establish whether they came from a single coherent image. Once 
trained on pristine images, featured by their EXIF header, the network can be used on 
any new image without further supervision. As already seen in Sect. 1.2.1, a Siamese 
approach can help to extract a camera-model fingerprint, where artifacts related to 
camera model are emphasized (Cozzolino and Verdoliva 2020, 2018; Cozzolino et al. 
2019). 
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11.5 Datasets 


For two-class learning-based approaches, having good data for training is of 
paramount importance. By “good”, we mean abundant (for data-hungry modern 
CNNs), representative (of the many possible manipulations encountered in the wild) 
and well curated (e.g., balanced, free from biases). Moreover, to assess the perfor- 
mance of new proposals, it is important to compare results on multiple datasets with 
different features. The research community has made considerable efforts through 
the years to release a number of datasets with a wide array of image manipulations. 
However, not all of them possess the right properties to support, by themselves, the 
development of learning-based methods. In fact, a way too popular approach is to 
split a single dataset in training, validation, and test set, carrying out training and 
experiments on this single source, a practice that may easily induce some forms of 
polarization or overfitting, if the dataset is not built with great care. 

In Fig. 11.10, we show a list of the most widespread datasets that have been pro- 
posed since 2011. During this short time lapse, the size of the datasets has rapidly 
increased, by roughly two orders of magnitude, together with the variety of manip- 
ulations (see Table 11.2). although the capacity to fool an observer has not grown at 
a similar pace. In this section, the most widespread datasets are described, and their 
features are briefly discussed. 

The Columbia dataset (Hsu and Chang 2006) is one of the first dataset proposed 
in the literature. It has been extensively used, however it includes only unrealistic 
forgeries without any semantic meaning. A more realistic dataset with splicings is 
DSO-1 (Carvalho et al. 2013) that in turn has the limitation that all images have the 
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Fig. 11.10 An overview of the most used datasets in the current literature for image manipulation 
detection and localization. We can observe that more recent ones are much larger, while the level 


of realism has not really improved upon time. Note that the size of the circles corresponds to the 
number of samples in the dataset 
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Table 11.2 List of datasets including generic image manipulations. Some of them are targeted for 
specific manipulations, while others are more various and contain different types of forgeries 


Dataset Refs. manipulations # prist. # forged 
Columbia gray Ng and Chang (2004) | Splicing (unrealistic) | 933 912 
Columbia color Hsu and Chang (2006) | Splicing (unrealistic) | 182 180 
CASIA v1 Dong et al. (2013) Splicing, copy-move | 800 921 
CASIA v2 Dong et al. (2013) Splicing, copy-move | 7,200 5;123 
MICC F220 Amerini et al. (2011) | Copy-move 110 110 
MICC F2000 Amerini et al. (2011) | Copy-move 1,300 700 
VIPP Bianchi and Piva Splicing 68 69 
(2012) 
FAU Christlein et al. (2012) | Copy-move 48 48 
DSO-1 Carvalho et al. (2013) | Splicing 100 100 
CoMoFoD Tralic et al. (2013) Copy-move 260 260 
Wild Web Zampoglou et al. Real-world cases 90 9,657 
(2015) 
GRIP Cozzolino et al. Copy-move 80 80 
(2015) 
RTD (Korus) Korus and Huang Various 220 220 
(2016) 
COVERAGE Wen et al. (2016) Copy-move 100 100 
NC2016 Guan et al. (2019) Various 560 564 
NC2017 Guan et al. (2019) Various 2,667 1,410 
FaceSwap Zhou et al. (2017) Face swapping 1,758 1,927 
MFC2018 Guan et al. (2019) Various 14,156 3,265 
PS-Battles Heller et al. (2018) Various 11,142 102,028 
MFC2019 MFC (2019) Various 10,279 5,750 
DEFACTO Mahfoudi et al. (2019) | Various _ 229,000 
FantasticReality Kniaz et al. (2019) Splicing 16,592 19,423 
SMIFD-500 Rahman et al. (2019) | Various 250 250 
IMD2020 Novozamsky et al. Various 414 2,010 
(2020) 


same resolution, they are in uncompressed format and there is no information on how 
many cameras were used. A very large dataset of face swapping has been introduced 
in Zhou et al. (2017), which contains around 4,000 real and manipulated images using 
two different algorithms. Several datasets have been designed for copy-move forgery 
detection (Christlein et al. 2012; Amerini et al. 2011; Tralic et al. 2013; Cozzolino 
et al. 2015; Wen et al. 2016), one of the most studied forms of manipulation. Some 
of them modify the copied object in multiple ways, including rotation, resizing, and 
change of illumination, to stress the capabilities of copy-move methods. 
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Many other datasets include various types of manipulations so as to present a more 
realistic scenario. The CASIA dataset (Dong et al. 2013) includes both splicings and 
copy-moves and inserted objects are post-processed to better fit the scene. Nonethe- 
less, it exhibits a strong polarization, as highlighted in Cattaneo and G. Roscigno 
(2014): authentic images can be easily separated from tampered ones since they were 
compressed with different quality factors. Forgeries of various nature are present also 
in the Realistic Tampering Dataset proposed in Korus and Huang (2016) that com- 
prises all uncompressed, but very realistic manipulations. The Wild Web Dataset is 
instead a collection of cases from the Internet (Zampoglou et al. 2017). 

The U.S. National Institute of Standards and Technology (NIST) has released 
several large datasets (Guan et al. 2019) aimed at testing forensic algorithms in 
increasingly realistic and challenging conditions. The first one, NC2016, is quite 
limited and present some redundancies, such as same images repeated multiple times 
in different conditions. While this can be of interest to study the behavior of some 
algorithms, it prevents the dataset to be split for training, validation, and test to 
avoid overfitting. Much more challenging and large datasets have been proposed by 
NIST in subsequent years: NC2017, MFC2018, and MFC2019. These datasets are 
very large and present very different types of manipulations, resolutions, formats, 
compression levels, and acquisition devices. 

Some very large datasets have been released recently. DEFACTO (Mahfoudi et al. 
2019) comprises over 200,000 images with a wide variety of realistic manipulations, 
both conventional, like splicings, copy-moves, and removals, and more advanced, 
like face morphing. The PS-Battles Dataset, instead, provides for each original image 
a varying number of manipulated versions (Heller et al. 2018), for a grand total of 
102,028 images. However, it does not provide ground truths and the original them- 
selves are not always pristine. SMIFD-500 (Social Media Image Forgery Detection 
Database) Rahman et al. (2019) is a set of 500 images collected from popular social 
media, while FantasticReality (Kniaz et al. 2019) is a dataset of splicing manipula- 
tions split into two parts: “Rough” and “Realistic”. The “Rough” part contains 8k 
splicings with obvious artifacts while the “Realistic” part provides 8k splicings that 
were retouched manually to obtain a realistic output. 

A dataset of manipulated images has been also built in Novozamsky et al. (2020). 
It features 35,000 images collected from 2,322 different camera models, and manip- 
ulated in various ways, e.g., copy-paste, splicing, retouching, and inpainting. More- 
over, it also includes a set of 2,000 realistic fake images downloaded from the Internet 
along with the corresponding real images. 

Since some methods work on anomaly detection and are trained only on pristine 
data, well-curated datasets comprising a large number of real images are also of 
interest. The Dresden image database (Gloe and Böhme 2010) contains over 14,000 
JPEG images from 73 digital cameras of 25 different models. To allow studying 
the peculiarities of different models and devices, the same scene is captured by a 
large number of cameras. Raw images can be found both in Dresden and in the 
RAISE dataset (Dang-Nguyen et al. 2015), composed of 8,156 images taken by 3 
different camera models. A dataset comprising both SDR (standard dynamic range) 
and HDR (high dynamic range) images has been presented in Shaya et al. (2018). 
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A total of 5,415 images were captured by 23 mobile devices in different scenarios 
under controlled acquisition conditions. The VISION dataset, instead, Shullani et al. 
(2017) includes 34,427 images and 1,914 videos from 35 devices of 11 major brands. 
They appear both in their original format and after uploading/downloading on various 
platforms (Facebook, YouTube, WhatsApp) so as to allow studies on data retrieved 
from social networks. Another mixed dataset, SOCRATES, is proposed in Galdi et al. 
(2017). It contains 6,200 images and 680 videos captured using 67 smartphones of 
14 brands and 42 models. 


11.6 Major Challenges 


In this section, we want to highlight some major challenges of current deep-learning- 
based approaches. 


e Localization versus Detection. To limit computational complexity and memory 
storage, deep networks work on relatively small input patches. Therefore, to pro- 
cess large images, one must either resize them or analyze them patch-wise. The 
first solution does not fit forensic needs, as it destroys precious statistical evi- 
dence related to local textures. On the other hand, a patch-wise analysis may miss 
image-level phenomena. A feature extractor trained on patches can only learn good 
features for local decisions, which are not necessarily good to make image-level 
decisions. Most deep-learning-based methods sweep under the rug the detection 
problem and focus on localization. Then, detection is addressed as an afterthought 
through some forms of processing of the localization map. Not surprisingly, the 
resulting detection performance is often very poor. 

Table 11.3 compares the localization and detection performances of some patch- 
wise methods. We present results in terms of Area Under the Curve (AUC) for 
detection and the Matthews Correlation Coefficient (MCC) for localization. MCC 
is robust to unbalanced classes (the forgery is often small compared to the whole 
image) its absolute value belongs to the range [0, 1]. It represents the cross- 
correlation coefficient between the decision map and the ground truth, computed 


as 
TPxTN-—FPXFN 


MCC = 
V(TP+FP)\TP+FN)\(TN+FP)(TN + FN) 


where 


— TP (true positive): # positive pixels declared positive; 
— TN (true negative): # negative pixels declared negative; 
— FP (false positive): # negative pixels declared positive; 
— FN (false negative): # positive pixels declared negative; 


We can note that, while the localization results are relatively good, detection results 
are much worse and often no better than coin flipping. A more intense effort on 
the detection problem seems necessary. 
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Table11.3 Localization by means of MCC and detection by means of AUC performance compared. 
With a few exceptions, the detection performance is very poor (AUC = 0.5 amounts to coin flipping) 


MCC/AUC DS0-1 VIPP RTD PS-Battles 
100/100 62/62 220/220 80/80 

RRU-Net Bi et al. (2019) 0.250/0.522 0.288/0.538 0.240/0.524 0.303/0.583 
Const. R-CNN Yang et al. (2020) 0.355/0.575 0.382/0.480 0.370/0.522 0.423/0.517 
EXIF-SC Huh et al. (2018) 0.529/0.763 0.402/0.633 0.278/0.541 0.345/0.567 
SPAN Hu et al. (2020) 0.318/0.697 0.336/0.556 0.260/0.564 0.320/0.583 
Adaptive-CFA Bammey et al. (2020) 0.121/0.380 0.130/0.498 0.127/0.472 0.094/0.526 
Noiseprint Cozzolino and Verdoliva (2020) 0.758/0.865 0.532/0.568 0.345/0.531 0.476/0.495 


e Robustness. Forensic analyses rely typically on the weak traces associated with 
the digital history of the target image/video. Such traces are further weakened and 
tend to disappear altogether in the presence of other quality-impairing processing 
steps. As an example, consider the image of Fig. 11.11, with a large splicing, 
localized with high accuracy. After compression or resizing, the localization map 
becomes more fuzzy, and decision unreliable. Likewise, in the images of Fig. 11.12 
with copy-moves, the attribution of the two copies becomes incorrect after heavy 
resizing. On the other hand, most images and videos of interest are found on 
social networks, where they are routinely heavily resized and recompressed for 
storage efficiency, with non-deterministic parameters. Therefore, robustness is a 
central problem that all multimedia forensic methods should confront with from 
the beginning. 

e Generalization. Carrying out a meaningful training is a hard task in image foren- 
sics, in fact, it is very easy to introduce some sort of bias. For example, one may 
train and test the network on images acquired only by a limited number of devices, 
identifying the provenance source instead of the artifact trace, or may use differ- 
ent JPEG compression pipelines for real and fake images, giving rise to different 
digital histories. A network will no doubt learn all these features if they help dis- 
criminating between pristine and tampered images, and may neglect more general 
but weaker traces of manipulation. This will lead to a very good performance on 
a training-aligned test set, and very bad performance on independent tests. There- 
fore, it is important to avoid biases to the extent possible, and even more important 
to test the trained networks on multiple and independent test sets. 

e Semantic Detection. Developing further on detection, a major problem is telling 
apart innocent manipulations from malicious ones. An image may be manipulated 
for many reasons, very frequently only for improving its appearance or to better 
fit it in the context of interest. These images should not be detected, as they are 
not really of interest and their number would easily overwhelm any subsequent 
analysis system. A good forensic algorithm should be able to single out only 
malicious attacks, based on the type of manipulation, often concerning only a 
small part of the image, and especially based on the context, including all other 
related media and textual information. 
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Image JPEG 90 JPEG 80 JPEG 70 JPEG 60 
Result Resizing 90% Resizing 75% Resizing 50% Resizing 40% 


Fig. 11.11 Results of splicing localization in the presence of different levels of compression and 
resizing. MCC is 0.998, it decreases to 0.985, 0.679, 0.412, and 0.419 by varying the compression 
level: JPEG quality equal to 90, 80, 70, and 60, respectively. MCC decreases to 0.994, 0.991, 0.948, 
and 0.517 by varying the resizing factor: 0.90, 0.75, 0.50, and 0.40, respectively 


Forged Image Ground truth Resizing 90% Resizing 75% Resizing 50% 
i t' i i: 
a a bee) 
B B B 


Fig. 11.12 Results of copy-move detection and attribution in the presence of resizing 


11.7 Conclusions and Future Directions 


Manipulating images and videos is becoming simpler and simpler, and fake media are 
becoming widespread, with the well-known associated risks. Therefore, the develop- 
ment of reliable multimedia forensic tools is more urgent than ever. In recent years, the 
main focus of research has been on deepfakes, images, and videos (often of persons) 
fully generated by means of AI tools. Yet, detecting computer-generated material 
seems to be relatively simple, for the time being. These images and videos lack some 
characteristic features of real media assets (think of device and model fingerprints) 
and exhibit instead traces of their own generation process (e.g., GAN fingerprints). 
Moreover, they can be generated in large number, providing for a virtually unlimited 
training set for deep learning methods to learn from. 

Cheapfakes, instead, though crafted with conventional methods, keep evading 
more easily the scrutiny of forensic tools. Also for them, detectors based on deep 
learning can be expected to provide improved performance. However, creating large 
unbiased datasets with manipulated images representative of the wide variety of 
attacks is not simple. In addition, the manipulations themselves rely on real (as 
opposed to generated) data, hardly detected as anomalous. Finally, manipulated 
images are often resized and recompressed, further weakening the feeble forensic 
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Splicing Copy-Move Inpainting 


Original Image 


Forged Image 


Fig. 11.13 Examples of conventional manipulations carried out using deep learning methods: 
splicing (Tan et al. 2018), copy-move (Thies et al. 2019), inpainting (Zhu et al. 2021) 


traces a detector can rely on. All this said, the current state of deep-learning-based 
methods for cheapfake detection can be considered promising, and further improve- 
ments will certainly come. 

However, new challenges loom on the horizon. The acquisition and processing 
technology evolves rapidly, deleting or weakening well-known forensic traces, just 
think of the effects of video stabilization and new shooting modes on PRNU-based 
methods (Mandelli et al. 2020; Baracchi et al. 2020). Of course, new traces will 
appear, but research may struggle keeping this fast pace. Likewise, also manipula- 
tion methods evolve, see Fig. 11.13. For example, the inpainting process has been 
significantly improved thanks to deep learning, with recent methods which fill image 
regions using newly generated semantically consistent content. In addition, methods 
have been already proposed to conceal the traces left by various types of attacks. 
Along this line, another peculiar problem of deep-learning-based methods is their 
vulnerability to adversarial attacks. The class (pristine/fake) of an asset can be easily 
changed by means of a number of simple methods, without significantly impairing 
its visual quality. 

As a final remark, we underline the need for interpretable tools. Deep learning 
methods, despite their excellent performance, act as black boxes, providing little or 
no hints on why certain decisions are made. Of course, such decisions may be easily 
questioned and may hardly stand in a court of justice. Improving explainability, 
though largely unexplored, is therefore a major topic for current and future work in 
this field. 
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Chapter 12 A) 
DeepFake Detection cilo; 


Siwei Lyu 


One particular disconcerting form of disinformation are the impersonating audios/ 
videos backed by advanced AI technologies, in particular, deep neural networks 
(DNNSs). These media forgeries are commonly known as the DeepFakes. The Al- 
based tools are making it easier and faster than ever to create compelling fakes that 
are challenging to spot. While there are interesting and creative applications of this 
technology, it can be weaponized to cause negative consequences. In this chapter, we 
survey the state-of-the-art DeepFake detection methods. We introduce the technical 
challenges in DeepFake detection and how researchers formulate solutions to tackle 
this problem. We discuss the pros and cons, as well as the potential pitfalls and 
drawbacks of each types of the solutions. Notwithstanding this progress, there are 
a number of critical problems that are yet to be resolved for existing DeepFake 
detection methods. We will also highlight a few of these challenges and discuss the 
research opportunities in this direction. 


12.1 Introduction 


Falsified images and videos created by AI algorithms, more commonly known as 
DeepFakes, are a recent twist to the disconcerting problem of online disinformation. 
Although fabrication and manipulation of digital images and videos are not new 
(Farid 2012), the rapid developments of deep neural networks (DNNs) in recent years 
have made the process of creating convincing fake images/videos increasingly easier 
and faster. DeepFake videos first caught the public’s attention in late 2017, when 
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Fig.12.1 Examples of DeepFake videos: (top) Head puppetry, (middle) face swapping, and (bot- 
tom) lip syncing 


a Reddit account with the same name began posting synthetic pornographic videos 
generated using a DNN-based face-swapping algorithm. Subsequently, technologies 
that make DeepFakes have been mainstreamed through readily available software 
freely available on GitHub.' There are also emerging online services” and start-up 
companies also commercialized tools that that can generate DeepFake videos on 
demand.* 

Currently, there are three major types of DeepFake videos. 


e Head puppetry entails synthesizing a video of a target person’s whole head and 
upper shoulder using a video of a source person’s head, so the synthesized target 
appears to behave the same way as the source. 

e Face swapping involves generating a video of the target with the faces replaced 
by synthesized faces of the source while keeping the same facial expressions. 

e Lip syncing is to create a falsified video by only manipulating the lip region so 
that the target appears to speak something that s/he does not speak in reality. 


Figure 12.1 shows some example frames of each type of DeepFake videos aforemen- 
tioned. 

While there are interesting and creative applications of DeepFake videos, due 
to the strong association of faces to the identity of an individual, they can also be 


1E, g., FakeApp (FakeApp 2020), DFaker (DFaker github 2019), faceswap-GAN (faceswap- 
GAN github 2019), faceswap (faceswap github 2019), and DeepFaceLab (DeepFaceLab github 
2020). 


2 E.g., https://deepfakesweb.com. 
3 E.g., Synthesia (https://www.synthesia.io/) and Canny AI https://www.cannyai.com/. 
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weaponized. Well-crafted DeepFake videos can create illusions of a person's pres- 
ence and activities that do not occur in reality, which can lead to serious political, 
social, financial, and legal consequences (Chesney and Citron 2019). The potential 
threats range from revenge pornographic videos of a victim whose face is synthe- 
sized and spliced in, to realistically looking videos of state leaders seeming to make 
inflammatory comments they never actually made, a high-level executive comment- 
ing about her company’s performance to influence the global stock market, or an 
online sex predator masquerades visually as a family member or a friend in a video 
chat. 

Since the first known case of DeepFake videos were reported in December 2017,* 
the mounting concerns over the negative impacts of DeepFakes have spawned an 
increasing interest in DeepFake detection in the Multimedia Forensics research com- 
munity, and the first dedicated DeepFake detection algorithm was developed by my 
group in June 2018 (Li et al. 2018). Subsequently, there are avid developments 
with many DeepFake detection methods developed in the past few years, e.g., Li 
et al. (2018, 2019a, b), Li and Lyu (2019) Yang et al. (2019), Matern et al. (2019), 
Ciftci et al. (2020), Afchar et al. (2018), Giiera and Delp (2018a, b), McCloskey and 
Albright (2018), Sabir et al. (2019), Rossler et al. (2019), Nguyen et al. (2019a, b, c), 
Nataraj et al. (2019), Xuan et al. (2019), Jeon et al. (2020), Mo et al. (2018), Liu et al. 
(2020), Fernando et al. (2019), Shruti et al. (2020), Koopman et al. (2018), Amerini 
et al. (2019), Chen and Yang (2020), Wang et al. (2019a), Guarnera et al. (2020), 
Durall et al. (2020), Frank et al. (2020), Ciftci and Demir (2019), Chen and Yang 
(2020), Stehouwer et al. (2019), Bonettini et al. (2020), Khalid and Woo (2020). Cor- 
respondingly, there have also been several large-scale benchmark datasets to evaluate 
DeepFake detection performance (Yang et al. 2019; Korshunov and Marcel 2018; 
R6ssler et al. 2019; Dufour et al. 2019; Dolhansky et al. 2019; Ciftci and Demir 
2019; Jiang et al. 2020; Wang et al. 2019b; Khodabakhsh et al. 2018; Stehouwer 
et al. 2019). DeepFake detection has also been supported by government funding 
agencies and private companies alike.° 

A climax of these efforts is the first DeepFake Detection Challenge from late 2019 
to early 2020 (Deepfake detection challenge 2021). Overall, the DeepFake Detection 
Challenge corresponds to the state-of-the-art, with the winning solutions being a tour 
de force of advanced DNNs (an average precision of 82.56% by the top performer). 
These provide us effective tools to expose DeepFakes that are automated and mass 
produced by AI algorithms. However, we need to be cautious in reading these results. 
Although the organizers have made their best effort to simulate situations that Deep- 
Fake videos are deployed in real life, there is still a significant discrepancy between 
the performance on the evaluation dataset and a more real dataset—when tested on 
unseen videos, the top performer’s accuracy reduced to 65.18%. In addition, all solu- 


4 https://www.vice.com/en/article/gydydm/gal-gadot-fake-ai-porn. 

5 Detection of DeepFakes is one of the goals of the DARPA MediFor (2016-2020), which also spon- 
sored the NIST MFC 2018 and 2020 Synthetic Data Detection Challenge (NIST MFC 2018), and 
SemaFor (2020-2024) programs. The Global DeepFake Detection Challenge (Deepfake detection 
challenge 2021) in 2020 was sponsored by leading tech companies including Facebook, Amazon, 
and Microsoft. 
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tions are based on clever designs of DNNSs and data augmentations, but provide little 
insight beyond the “black box”-type classification algorithms. Furthermore, these 
detection results may not completely reflect the actual detection performance of the 
algorithm on a single DeepFake video, especially ones that have been manually pro- 
cessed and perfected after being generated from the AI algorithms. Such “crafted” 
DeepFake videos are more likely to cause real damages, and careful manual post 
processing can reduce or remove artifacts that the detection algorithms predicate on. 

In this chapter, we survey the state-of-the-art DeepFake detection methods. We 
introduce the technical challenges in DeepFake detection and how researchers for- 
mulate solutions to tackle this problem. We discuss the pros and cons, as well as the 
potential pitfalls and drawbacks of each types of the solutions. We also provide an 
overview of research efforts for DeepFake detection and a systematic comparison of 
existing datasets, methods, and performances. Notwithstanding this progress, there 
are a number of critical problems that are yet to be resolved for existing DeepFake 
detection methods. We will also highlight a few of these challenges and discuss the 
research opportunities in this direction. 


12.2 DeepFake Video Generation 


Although in recent years there have been many sophisticated algorithms for generat- 
ing realistic synthetic face videos (Bitouk et al. 2008; Dale et al. 2011; Suwajanakorn 
et al. 2015, 2017; Thies et al. 2016; Korshunova et al. 2017; Pham et al. 2018; Karras 
et al. 2018a, 2019; Kim et al. 2018; Chan et al. 2019), most of these have not been 
in mainstream as open-source software tools that anyone can use. It is a much sim- 
pler method based on the work of neural image style transfer (Liu et al. 2017) that 
becomes the tool of choice to create DeepFake videos in scale, with several indepen- 
dent open-source implementations. We refer to this method as the basic DeepFake 
maker, and it is underneath many DeepFake videos circulated on the Internet or in 
the existing datasets. 

The overall pipeline of the basic DeepFake maker is shown in Fig. 12.2 (left). 
From an input video, faces of the target are detected, from which facial landmarks are 
further extracted. The landmarks are used to align the faces to a standard configuration 
(Kazemi and Sullivan 2014). The aligned faces are then cropped and fed to an auto- 
encoder (Kingma and Welling 2014) to synthesize faces of the donor with the same 
facial expressions as the original target’s faces. 

The auto-encoder is usually formed by two convoluntional neural networks 
(CNNs), i.e., the encoder and the decoder. The encoder E converts the input tar- 
get’s face to a vector known as the code. To ensure the encoder capture identity- 
independent attributes such as facial expressions, there is one single encoder regard- 
less the identities of the subjects. On the other hand, each identity has a dedicated 
decoder D;, which generates a face of the corresponding subject from the code. The 
encoder and decoder are trained in tandem using uncorresponded face sets of mul- 
tiple subjects in an unsupervised manner, Fig. 12.2 (right). Specifically, an encoder- 
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Fig.12.2 Synthesis (left) and training (right) of the basic DeepFake maker algorithm. See texts for 
more details 
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decoder pair is formed alternatively using E and D; for input face of each subject, 
and optimize their parameters to minimize the reconstruction errors (¢; difference 
between the input and reconstructed faces). The parameter update is performed with 
the back-propagation until convergence. 

The synthesized faces are then warped back to the configuration of the original 
target’s faces and trimmed with a mask from the facial landmarks. The last step 
involves smoothing the boundaries between the synthesized regions and the original 
video frames. The whole process is automatic and runs with little manual intervention. 


12.3 Current DeepFake Detection Methods 


In this section, we provide a brief overview of current DeepFake detection meth- 
ods. As this is an actively researched area with the literature growing rapidly, we 
cannot promise comprehensive coverage of all existing methods. Furthermore, as 
the focus here is on DeepFake videos, we exclude detection methods for GAN- 
generated images. Our objective is to summarize the existing detection methods at a 
more abstract level, pointing out some common characteristics and challenges that 
can benefit future developments of more effective detection methods. 


12.3.1 General Principles 


As detection of DeepFakes is a problem in Digital Media Forensics, it complies with 
three general principles of Digital Media Forensics. 


e Principle 1: Manipulation operations leave traces in the falsified media. 
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Digitally manipulated or synthesized media (including DeepFakes) are created 
by a process other than a capturing device recording an event actually occurs 
in the physical world. This fundamental difference in the creation process will 
be reflected in the resulting falsified media, albeit revealed with different scales 
depending on the amount of changes to the original media. Based on this principle, 
we can conclude that DeepFakes are detectable. 

Principle 2: A single forensic measure can be circumvented. 

Any forensic detection method is based on differentiating the real and falsified 
media on certain characteristics in the media signals. However, the same charac- 
teristics can be exploited to evade the very forensic detections, either by hiding the 
traces or disrupting them. This principle is the premise for anti-forensic measures 
for DeepFake detections. 

Principle 3: There is an intention behind every falsified media. 

A falsified media in the wild is made for a reason. It could be a satire or a prank but 
also a malicious attack to the victim’s reputation and credibility. Understanding the 
motivation behind the DeepFake can provide richer information to the detection 
of DeepFakes, as well as to prevent and mitigate the damages. 


12.3.2 Categorization Based on Methodology 


We categorize existing DeepFake detection methods into three types. The first two 
work by seeking specific artifacts in DeepFake videos that differ them from the real 
videos (Principle 1 of Sect. 12.3.1). 


12.3.2.1 Signal Feature-Based Methods 


The signal feature-based methods (e.g., Afchar et al. 2018; Güera and Delp 2018b; 
McCloskey and Albright 2018; Li and Lyu 2019) look for abnormalities at the signal 
level, treating the videos as a sequence of frames f(x, y, t), and a synchronized 
audio signal a(t) if the audio track exists. Such abnormalities are often caused by 
various steps in the processing pipeline of DeepFake generation. 

For instance, our recent work (Li and Lyu 2019) exploits the signal artifacts 
introduced by the resizing and interpolation operations during the post-processing of 
DeepFake generation. Similarly, the work in Li et al. (2019a) focuses on the boundary 
when the synthesized face regions are blended into the original frame. In Frank et al. 
(2020), the authors observe that synthesized faces often exhibit abnormalities in high 
frequencies, due to the up-sampling operation. Based on this observation, a simple 
detection method in the frequency domain is proposed. In Durall et al. (2020), the 
detection method uses a classical frequency domain analysis followed by an SVM 
classifier. The work in Guarnera et al. (2020) uses the EM algorithm to extract a 
set of local features in the frequency domain that are used to distinguish frames of 
DeepFakes from those of the real videos. 
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The main advantage of signal feature-based methods is that the abnormalities 
are usually fundamental to generation process, and fixing them may require signif- 
icant changes to the underlying DNN model or post-processing steps. On the other 
hand, the reliance on signal features also means that signal feature-based DeepFake 
detection methods are susceptible to disturbance to the signals, such as interpola- 
tion (rotation, up-sizing), down-sampling (down-sizing), additive noise, blurring, 
and compression. 


12.3.2.2 Physical/Physiological-Based Methods 


The physical/physiological-based methods (e.g., Li et al. 2018; Yang et al. 2019; 
Matern et al. 2019; Ciftci et al. 2020; Hu et al. 2020) expose DeepFake videos based 
on their violations to the fundamental laws of physics or human physiology. The 
DNN model synthesizing human faces do not have direct knowledge about physio- 
logical traits of human faces or the physical laws of the surrounding environment, 
such information may be incorporated into the model indirectly (and inefficiently) 
through training data. This lack of knowledge can lead to detectable traits that are 
also intuitive to humans. For instance, the first dedicated DeepFake detection method 
(Li et al. 2018) works by detecting the inconsistency or lack of realistic eye blinking 
in DeepFake videos. This is due to the training data for the DeepFake generation 
model, which are often obtained from the Internet as the portrait images of a subject. 
As the portrait images are dominantly those with the subject’s eyes open, the DNN 
generation models trained using them cannot reproduce closed eyes for a realistic 
eye blinking. The work of Yang et al. (2019) use the physical inconsistencies in the 
head poses of the DeepFake videos due to splicing the synthesized face region into 
the original frames. The work of Matern et al. (2019) summarizes various appear- 
ance artifacts that can be observed in DeepFake videos, such as the inconsistent eye 
colors, missing reflections, and fuzzy details in the eye and teeth areas. The authors 
then propose a set of simple features for for DeepFake detection. The work in Ciftci 
et al. (2020) introduces a DeepFake detection method based on biological signals 
extracted from facial regions that are invisible to human eyes in portrait videos, 
such as the slight color changes due to the blood flow caused by heart beats. These 
biological signals are used to reveal the spatial coherence and temporal consistency. 

The main advantage of physical/physiological-based methods is its superior 
intuitiveness and explainability. This is especially important when the detection 
results are used by practitioners such as journalists. On the other hand, physical/ 
physiological-based detection methods are limited by the effectiveness and robust- 
ness of the underlying Computer Vision algorithms, and their strong reliance on the 
semantic cues also limits their applicability. 
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12.3.2.3 Data-Driven Methods 


The third category of detection methods are data-driven methods (e.g., Sabir et al. 
2019; Rössler et al. 2019; Nguyen et al. 2019a, b,c; Nataraj et al. 2019; Xuan et al. 
2019; Wang et al. 2019b; Jeon et al. 2020; Mo et al. 2018; Liu et al. 2020; Li et al. 
2019a, b; Giiera and Delp 2018a; Fernando et al. 2019; Shruti et al. 2020; Koopman 
et al. 2018; Amerini et al. 2019; Chen and Yang 2020; Wang et al. 2019a; Guarnera 
et al. 2020; Durall et al. 2020; Frank et al. 2020; Ciftci and Demir 2019; Chen and 
Yang 2020; Stehouwer et al. 2019; Bonettini et al. 2020; Khalid and Woo 2020), 
which do not target at specific features, but use videos labeled as real or DeepFake 
to train ML models (oftentimes DNNs) that can differentiate DeepFake videos from 
the real ones. Note that feature-based methods can also use DNN classifiers, so what 
makes the data-driven methods different is that the cues for classifying the two types 
of videos are implicit and found by the ML models. As such, the success of data- 
driven methods largely depends on the quality and diversity of training data, as well 
as the design of the ML models. 

Early data-driven DeepFake detection methods reuse standard DNN models (e.g., 
VGG-Net Do et al. 2018, XceptionNet Rossler et al. 2019) designed for other Com- 
puter Vision tasks such as object detection and recognition. There also exist methods 
that use more specific or novel network architectures. MesoNet (Afchar et al. 2018) 
is an early detection method that uses a customized DNN model to detect DeepFakes, 
and interprets the detection results by correlating them with some mesoscopic level 
image features. The DeepFake detection method of Liu et al. (2020) uses the Gram- 
Net to capture differences in global image texture, which has the additional “Gram 
blocks” in the conventional CNN models that can calculate the Gram matrices at dif- 
ferent levels of the network. The detection methods in Nguyen et al. (2019c, b) use 
capsule networks and show that it can achieve similar performance but with fewer 
parameters than the CNN models. 

It should be mentioned that, to date, the state-of-the-art performance in DeepFake 
detection is obtained with the data-drive methods based on large-scale DeepFake 
datasets and innovative model design and training procedures. For instance, the top 
performer of the DeepFake Detection Challenge’ is based on the use of an ensemble 
DNN models that entails seven different EfficientNet models Tan and Le (2019). 
The runner-up method’ also uses ensemble models that includes EfficientNet and 
XceptionNet models but also introduce a new data augmentation methods (WSDAN) 
to anticipate the different configurations of faces. 


Š https://github.com/selimsef/dfdc_deepfake_challenge. 
7 https://github.com/cuihaoleo/kaggle-dfdc. 
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12.3.3 Categorization Based on Input Types 


Another way to categorize existing DeepFake detection methods can be obtained 
by the type of the inputs. Most existing DeepFake detection methods are based on 
binary classification at the frame level, i.e., determining the likelihood of an individual 
frame as real or of DeepFake. Although simple and easy to implement, there are two 
issues related with the frame-based detection methods. First, the temporal consistency 
among frames are not explicitly considered, as (1) many DeepFake videos exhibit 
temporal artifacts and (ii) real or DeepFake frames tend to appear in continuous 
intervals. Second, it necessitates an extra step when video-level integrity score is 
needed: we have to aggregate the scores over individual frames to compute such a 
score using certain types of aggregation rules, common choices of which include 
the average, the maximum, or the average of top range. Temporal methods (e.g., 
Sabir et al. 2019; Amerini et al. 2019; Koopman et al. 2018), on the other hand, take 
the whole frame sequences as input, and use the temporal correlation between the 
frames as an intrinsic feature. Temporal methods often uses sequential models such 
as RNNs as basic model structure, and directly output the videos level prediction. 
The audio-visual detection methods also use the audio track as an input, and detect 
DeepFakes based on the asynchrony between the audios and frames. 


12.3.4 Categorization Based on Output Types 


Regardless of the underlying methodology or types of input, the majority of existing 
DeepFake detection methods formulate the problem as binary classification, which 
returns a label for each input video that signifies if the input is a real or a DeepFake. 
Often, the predicted labels are accompanied with a confidence score, a real value 
in [0, 1], which may be interpreted as the probability of the input to belong to one 
of the classes (i.e., real or DeepFake). A few methods extend this to a multi-class 
classification problem, where the labels also reflect different types of DeepFake 
generation models. There are also methods that solve the location problem (e.g., Li 
et al. 2019b; Huang et al. 2020), which further identifies the spatial area (in the form 
of bounding boxes or masked out region) and time interval of DeepFake treatments. 


12.3.5 The DeepFake-o-Meter Platform 


Unfortunately, the plethora of DeepFake detection algorithms were not taken full 
advantage of. On the one hand, differences in training datasets, hardware, and learn- 
ing architectures across research publications make rigorous comparisons of differ- 
ent detection algorithms challenging. On the other hand, the cumbersome process 
of downloading, configuring, and installing of individual detection algorithms deny 
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Fig.12.3 Detection results on DeepFake-o-meter of a state-of-the-art DeepFake detection method 
(Li and Lyu 2019) over a fake video on youtube.com. The lower integrity score (range in [0, 1]) 
suggests a video frame more likely to be generated using DeepFake algorithms 


the access of the state-of-the-art DeepFake detection methods to most users. To this 
end, we have developed DeepFake-o-meter http://zinc.cse.buffalo.edu/ubmdfi/ 
deepfake-o-meter/, an open platform for DeepFake detections. For developers of 
DeepFake detection algorithms, it provides an API architecture to wrap individual 
algorithms and run on a third-party remote server. For researchers, it is an eval- 
uation/benchmarking platform to compare multiple algorithms on the same input. 
For users, it provides a convenient portal to use multiple state-of-the-art detection 
algorithms. Currently, we have incorporated 11 state-of-the-art DeepFake image and 
video detection methods. A sample analysis result is shown in Fig. 12.3. 


12.3.6 Datasets 


The avallability of large-scale datasets of DeepFake videos is an enabling factor 
to the development of DeepFake detection methods. The first DeepFake dataset 
UADFYV (Yang et al. 2019) only has 49 DeepFake videos with visible artifacts when 
it was released in June 2018. Subsequently, more DeepFake datasets are proposed 
with increasing quantities and qualities. Several examples of synthesized DeepFake 
video frames are shown in Fig. 12.4. 


e UADFYV (Yang et al. 2019): This dataset contains 49 real videos downloaded from 
Youtube, which were used to create 49 DeepFake videos. 

e DeepfakeTIMIT (Korshunov and Marcel 2018): The original videos in this dataset 
is from the VidTIMIT database, from which a total of 640 DeepFake videos were 
generated. 

e FaceForensics++ (Rössler et al. 2019): FaceForensics++ is a forensics dataset 
consisting of 1,000 original video sequences that have been manipulated with 
four automated face manipulation methods: DeepFakes, Face2Face, FaceSwap 
and NeuralTextures. 
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Fig. 12.4 Example frames of DeepFake videos from the Celeb-DF dataset (Li et al. 2020). The 
left most column corresponds to frames from the original videos, and the other columns are frame 
from DeepFake videos swapped with synthesized faces 
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DFD: The Google/Jigsaw DeepFake detection dataset has 3,068 DeepFake videos 
generated based on 363 original videos of 28 consented individuals of various 
genders, ages, and ethnic groups. 

DFDC (Dolhansky et al. 2019): The Facebook DeepFake Detection Challenge 
(DFDC) Dataset is part of the DeepFake detection challenge, which has 4,113 
DeepFake videos created based on 1,131 original videos of 66 consented individ- 
uals of various genders, ages and ethnic groups. 

Celeb-DF (Li et al. 2020): The Celeb-DF dataset contains real and DeepFake 
synthesized videos having similar visual quality on par with those circulated online. 
It includes 590 original videos collected from YouTube with subjects of different 
ages, ethic groups and genders, and 5,639 corresponding DeepFake videos. 
DeeperForensics-1.0 (Jiang et al. 2020): DeeperForensics-1.0 consists of 10,000 
DeepFake videos generated by a new DNN-based face swapping framework. 
DFFD (Stehouwer et al. 2019): The DFFD dataset contains 3,000 videos of four 
types of digital manipulations as identity swap, expression, swap, attribute manip- 
ulation, and entire synthesized faces. 


12.3.7 Challenges 


Albeit impressive progress has been made in the performance of detection of Deep- 
Fake videos, there are several concerns over the current detection methods that sug- 
gest caution. 

Performance Evaluation. Currently, the problem of detecting DeepFake videos 
is commonly formulated, solved, and evaluated as a binary classification problem, 
where each video is categorized as real or a DeepFake. Such dichotomy is easy to 
set up in controlled experiments, where we develop and test DeepFake detection 
algorithms using videos that are either pristine or made with DeepFake generation 
algorithms. However, the picture is murkier when the detection method is deployed 
in real world. For instance, videos can be fabricated or manipulated in ways other 
than DeepFakes, so not being detected as a DeepFake video does not necessarily 
suggest the video is a real one. Also, a DeepFake video may be subject to other 
types of manipulations and a single label may not comprehensively reflect such. 
Furthermore, in a video with multiple subjects’ faces only one or a few are generated 
with DeepFake for a fraction of the frames. So the binary classification scheme needs 
to be extended to multi-class, multi-label, and local classification/detection to fully 
handle the complexities of real-world media forgeries. 

Explainability of Detection Results. Current DeepFake detection methods are 
mostly designed to perform batch analysis over a large collection videos. However, 
when the detection methods are used in the field by journalists or law enforcement, 
we usually need only to analyze a small number of videos. Numerical score corre- 
sponding to the likelihood of a video being generated using a synthesis algorithm is 
not as useful to the practitioners if it is not corroborated with proper reasoning of the 
score. In such scenarios, it is very typical to request a justification for the numerical 
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score for the analysis to be acceptable for publishing or used in court. However, many 
data-driven Deepfake detection methods, especially those based on the use of deep 
neural networks, usually lack explainability due to the black box nature of the DNN 
models. 

Generalization Across Datasets. As a DNN-based DeepFake detection method 
needs to be trained on a specific training dataset, a lack of generalization has been 
observed for DeepFake videos created using models not represented in the training 
dataset. This is different from overfitting, as the learned model may perform well 
on testing videos that are created with the same model but not used in training. The 
problem is caused by “domain shifting” when the trained detection method is applied 
to DeepFake videos generated using different generation models. A simple solution 
is to enlarge the training set to represent more diverse generation models, but a more 
flexible approach is needed to scale up to previously unseen models. 

Social Media Laundering. A large fraction of online videos are now spread through 
social networks, e.g., FaceBook, Instagram, and Twitter. To save network bandwidth 
and also to protect the users’ privacy, these videos are usually striped off meta- 
data, down-sized, and then heavy compressed before they are uploaded to the social 
platforms. These operations, commonly known as social media laundering, are detri- 
mental to recover traces of underlying manipulation, and at the same time increase 
the false positive detections, i.e., classifying a real video as a DeepFake. So far, most 
data-driven DeepFake detection methods that use signal-level features are much 
affected by social media laundering. A practical measure to improve the robustness 
of DeepFake detection methods to social media laundering is to actively incorporate 
simulations of such effects in training data, and also enhance evaluation datasets to 
include performance on social media laundered videos, both real and synthesized. 
Anti-forensics. With the increasing effectiveness of DeepFake detection methods, 
we also anticipate developments of corresponding anti-forensic measures, which take 
advantage of the vulnerabilities of current DeepFake detection methods to conceal 
revealing traces of DeepFake videos. The data-driven DNN-based DeepFake detec- 
tion methods are particularly susceptible to anti-forensic attacks due to the known 
vulnerability of general deep neural network classification models (see Principle 2 
of Sect. 12.3.1). Indeed, recent years have witnessed a rapid development of anti- 
forensic methods based on adversarial attacks to DNN models targeting DeepFake 
detectors (Huang et al. 2020; Carlini and Farid 2020; Gandhi and Jain 2020; Neekhara 
2020). Anti-forensic measures can also be developed in the other aspect, to disguise a 
real video as a DeepFake video by adding simulated signal level features used by cur- 
rent detection algorithms, a situation we term as fake DeepFake. Further, DeepFake 
detection methods must improve to handle such intentional and adversarial attacks. 
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12.4 Future Directions 


Besides continuing improving to solve the aforementioned limitations, we also envi- 
sion a few important directions of DeepFake detection methods that will receive 
more attention in the coming years. 

Other Forms of DeepFake Videos. Although face swapping is currently the most 
widely known form of DeepFake videos, it is by no means the most effective. In par- 
ticular, for the purpose of impersonating someone, face swapping DeepFake videos 
have several limitations. Psychological studies (Sinha et al. 2009) show that human 
face recognition largely relied on information gleaned from face shape and hairstyle. 
As such, to create convincing impersonating effect, the person whose face is to be 
replaced (the target) has to have similar face shape and hairstyle to the person whose 
face is used for swapping (the donor). Second, as the synthesized faces need to be 
spliced into the original video frame, the inconsistencies between the synthesized 
region and the rest of the original frame can be severe and difficult to conceal. In these 
respects, other forms of DeepFake videos, namely, head puppetry and lip-syncing, are 
more effective and thus should become the focus of subsequent research in DeepFake 
detection. Methods studying whole face synthesis or reenactment have experienced 
fast development in recent years. Although there have not been as many easy-to-use 
and free open-source software tools generating these types of DeepFake videos as 
for the face-swapping videos, the continuing sophistication of the generation algo- 
rithms will change the situation in the near future. Because the synthesized region 
is different from face swapping DeepFake videos (the whole face in the former and 
lip area in the latter), detection methods designed based on artifacts specific to face 
swapping are unlikely to be effective for these videos. Correspondingly, we should 
develop detection methods that are effective to these types of DeepFake videos. 
Audio DeepFakes. Al-based impersonation are not limited to imagery, recent 
Al-synthesized content-generation are leading to the creation of highly realistic 
audios (Ping et al.2017; Gu and Kang 2018; AlBadawy and Lyu 2020). Using synthe- 
sized audios of the impersonating target can significantly make the DeepFake videos 
more convincing and compounds its negative impact. As audio signals are 1D signals 
and have very different nature from images and videos, different methods need to 
be developed to specifically target such forgeries. This problem has drawn attention 
in the speech processing community recently with part of the most recent Global 
ASVspoofing Challenge (https://www.asvspoof.org/). Dedicated to AI-driven voice 
conversion detection, and a few dedicated methods for audio DeepFake detection, 
e.g., AlBadawy et al. (2019), have also shown up recently. In the coming years, we 
expect more developments in these areas, in particular, those can leverage features 
in both visual and audio features of the fake videos. 

Intent Inference. Even though the potential negative impacts of DeepFake videos 
are tremendous, in reality, the majority of DeepFake videos are not created not with 
a malicious intent. Many DeepFake videos currently circulated online are of a prank- 
some, humorous, or satirical nature. As such, it is important to expose the underlying 
intent of a DeepFake in the context of legal or journalistic investigation (Principle 3 
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in Sect. 12.3.1). Inferring intention may require more semantic and contextual under- 
standing of the content, few forensic methods are designed to answer this question, 
but this is certainly a direction that future forensic methods will focus on. 

Human Performance. Although the potential negative impacts of online DeepFake 
videos are widely recognized, currently there is a lack of formal and quantitative 
study of the perceptual and psychological factors underlying their deceptiveness. 
Interesting questions such as if there exist an uncanny valley’ for DeepFake videos, 
what is the just noticeable difference between high-quality DeepFake videos and real 
videos to human eyes, or what type/aspects of DeepFake videos are more effective in 
deceiving the viewers, have yet to be answered. To pursue these questions, it calls for 
close collaboration among researchers in digital media forensics and in perceptual 
and social psychology. There is no doubt that such studies are invaluable to research 
in detection techniques as well as a better understanding of the social impact that 
DeepFake videos can cause. 

Protection measures. However, given the speed and reach of the propagation of 
online media, even the currently best forensic techniques will largely operate in 
a postmortem fashion, applicable only after AI synthesized fake face images or 
videos emerge. We aim to develop proactive approaches to protect individuals from 
becoming the victims of such attacks, which complement to the forensic tools. 

One such method we have recently studied (Li et al. 2019c; Sun et al. 2020) is 
to add specially designed patterns known as the adversarial perturbations that are 
imperceptible to human eyes but can result in detection failures. The rationale is 
as follows. High-quality AI face synthesis models need large number of, typically 
in the range of thousands, sometimes even millions, training face images collected 
using automatic face detection methods, i.e., the face sets. Adversarial perturbations 
“pollute” a face set to have few actual faces and many non-faces with low or no utility 
as training data for AI face synthesis models. The proposed adversarial perturbation 
generation method can be implemented as a service of photo/video sharing platforms 
before a user’s personal images/videos are uploaded or as a standalone tool that the 
user can use, to process the images and videos before they are uploaded online. 


12.5 Conclusion and Outlook 


We predict that several future technological developments will further improve the 
visual quality and generation efficiency of the fake videos. Firstly, one critical dis- 
advantage of the current DeepFake generation methods are that they cannot produce 
good details such as skin and facial hairs. This is due to the loss of information in the 
encoding step of generation. However, this can be improved by incorporating GAN 
models (Goodfellow et al. 2014) which have demonstrated performance in recov- 


8 The uncanny valley in this context refers to the phenomenon whereby a DeepFake generated face 
bearing a near-identical resemblance to a human being arouses a sense of unease or revulsion in the 
viewers. 
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ering facial details in recent works (Karras et al. 2018b, 2019, 2020). Secondly, 
the synthesized videos can be more realistic if they are accompanied with realistic 
voices, which combines video and audio synthesis together in one tool. 

In the face of this, the overall running efficiency, detection accuracy, and more 
importantly, false positive rate, have to be improved for wide practical adoption. 
The detection methods also need to be more robust to real-life post-processing steps, 
social media laundering, and counter-forensic technologies. There is a perpetual 
competition of technology, know-hows, and skills between the forgery makers and 
digital media forensic researchers. The future will reckon the predictions we make 
in this work. 
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Chapter 13 A) 
Video Frame Deletion and Duplication get 


Chengjiang Long, Arslan Basharat, and Anthony Hoogs 


Videos can be manipulated in a number of different ways, including object addition 
or removal, deep fake videos, temporal removal or duplication of parts of the video, 
etc. In this chapter, we provide an overview of the previous work related to video 
frame deletion and duplication and dive into the details of two deep-learning-based 
approaches for detecting and localizing frame deletion (Chengjiang et al. 2017) and 
duplication (Chengjiang et al. 2019) manipulations. This should provide the reader a 
brief overview of the related research and details of a couple of deep-learning-based 
forensics methods to defend against temporal video manipulations. 


13.1 Introduction 


Digital video forgery (Sowmya and Chennamma 2015) is referred to as intentional 
modification of the digital video for fabrication. A common digital video forgery 
technique is temporal manipulation, which includes frame sequence manipulations 
such as dropping, insertion, reordering and looping. By altering only the temporal 
aspect of the video the manipulation is not detectable by single-image forensic tech- 
niques; therefore, there is need for digital forensics methods that perform temporal 
analysis of videos to detect such manipulations. 

In this chapter, we will first focus on the problem of video frame deletion detection 
in a given, possibly manipulated, video without the original video. As illustrated in 
Fig. 13.1, we define a frame drop to be a removal of any number of consecutive 
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Original video 


— Luni. 


Dropped frames 


Fig. 13.1 The illustration of frame dropping detection challenge. Assuming that there are three 
consecutive frame sequences (marked in red, green and blue, respectively) in an original video, the 
manipulated video is obtained after removing the green frame sequence. Our goal is to identify the 
location of the frame drop at the end of the red frame sequence and the beginning of the blue frame 
sequence 


frames within a video shot.! In our work (Chengjiang et al. 2017) to address this 
problem, we only consider videos with a single shot to avoid the confusion between 
frame drops and shot breaks. Single-shot videos are prevalent from various sources, 
like mobile phones, car dashboard cameras or body-worn cameras. 

To the best of our knowledge, only a small amount of recent work (Thakur 
et al. 2016) has explored automatically detecting dropped frames without a refer- 
ence video. In digital forgery detection, we cannot assume a reference video, unlike 
related techniques that detect frame drops for quality assurance. Wolf (2009) pro- 
posed a frame-by-frame motion energy cue defined based on the temporal infor- 
mation difference sequence for finding dropped/repeated frames, among which the 
changes are slight. Unlike Wolf’s work, we detect the locations where frames are 
dropped in a manipulated video without being compared with the original video. 
Recently, Thakur et al. (2016) proposed an SVM-based method to classify tampered 
or non-tampered videos. In this work, we explore the authentication (Valentina et al. 
2012; Wang and Farid 2007) of the scene or camera to determine if a video has one 
or more frame drops without a reference or original video. We expect such authen- 
tication is able to explore underlying spatio-temporal relationships across the video 
so that it is robust to digital-level attacks and conveys a consistency indicator across 
the frame sequences. 

We believe that we can still use similar assumption that consecutive frames are 
consistent with each other and the consistency will be destroyed if there exists tem- 
poral manipulation. To authenticate a video, two-frame techniques such as color 
histogram, motion energy (Stephen 2009) and optical flow (Chao et al. 2012; Wang 
et al. 2014) have been used. By only using two frames these techniques cannot gen- 
eralize to work on both videos with rapid scene changes (often from fast camera 


' A shot is a consecutive sequence of frames captured between the start and stop operations of a 
single video camera. 
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Original video 
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Fig. 13.2 An illustration of frame duplication manipulation in a video. Assume an original video 
has three sets of frames indicated here by red, green and blue rectangles. A manipulated video can 
be generated by inserting a second copy of the red set in the middle of the green and the blue sets. 
Our goal is to detect both instances of the red set as duplicated and also determine that the second 
instance is the one that’s forged 


motion) and videos with subtle scene changes such as static camera surveillance 
videos. 

In the past few years, deep learning algorithms have made significant break- 
throughs, especially in the image domain (Krizhevsky et al. 2012). The features com- 
puted by these algorithms have been used for image matching/classification (Zhang 
et al. 2014; Zhou et al. 2014). In this chapter, we evaluate approaches using these 
features for dropped frame detection using two to three frames. However, these 
image-based deep features still lack modeling the motion effectively. 

Inspired by Tran et al.’s C3D network (Tran et al. 2015), which is able to extract 
powerful spatio-temporal features for action recognition, we propose a C3D-based 
network for detecting frame drops, as illustrated in Fig. 13.3. As we can observe, 
there are three aspects to distinguish our C3D-based network approach (Chengjiang 
et al. 2017) from Tran et al.’s work. (1) Our task is to check whether there exist 
frames dropped between the 8th and the 9th frame, which makes the center part 
more informative than the two ends of the 16-frame video clips; (2) the output of the 
network has two branches, which correspond to “frame drop” and “no frame drop”, 
between the 8th and the 9th frame; (3) unlike most approaches, we use the output 
scores from the network as confidence score directly and define confidence score 
with a peak detection step and a scale term based on the output score curves; and (4) 
such a network is able to not only predict whether the video has frame dropping but 
also detect the exact location where the frame dropping occurs. 

To summarize, the contributions of our work (Chengjiang et al. 2017) are: 


e Proposed a 3D convolutional network for frame dropping detection, and the con- 
fidence score is defined with a peak detection step and a scale term based on the 
output score curves. It is able to identify whether there exists frame dropping and 
even determine the exact location of frame dropping without any information of 
the reference/original video. 

e For performance comparison, we also compared a series of baselines, includ- 
ing cue-based algorithms (color histogram, motion energy and optical flow) and 
learning-based algorithms (an SVM algorithm and convolutional neural networks 
(CNNs) using two or three frames as input). 
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e The experimental results on both the Yahoo Flickr Creative Commons 100 Million 
(YFCC100m) dataset and the Nimble Challenge 2017 dataset clearly demonstrate 
the efficacy of the proposed C3D-based network. 


An increasingly large volume of digital video content is becoming available in our 
daily lives through the internet due to the rapid growth of increasingly sophisticated, 
mobile and low-cost video recorders. These videos are often edited and altered for 
various purposes using image and video editing tools that have become more readily 
available. Manipulations or forgeries can be done for nefarious purposes to either 
hide or duplicate an event or content in the original video. Frame duplication refers to 
a video manipulation where a copy of a sequence of frames is inserted into the same 
video either replacing previous frames or as additional frames. Figure 13.2 provides 
an example of frame duplication where in the manipulated video the red frame 
sequence from the original video is inserted between the green and the blue frame 
sequences. As a real-world example, frame duplication forgery could be done to hide 
an individual leaving a building in a surveillance video. If such a manipulated video 
was part of a criminal investigation, without effective forensics tools the investigators 
could be misled. 

Videos can also be manipulated by duplicating a sequence of consecutive frames 
with the goal of concealing or imitating specific content in the same video. In 
this chapter, we also describe a coarse-to-fine framework based on deep convo- 
lutional neural networks to automatically detect and localize such frame duplica- 
tion (Chengjiang et al. 2019). First, an I3D network finds coarse-level matches 
between candidate duplicated frame sequences and the corresponding selected origi- 
nal frame sequences. Then a Siamese network based on ResNet architecture identifies 
fine-level correspondences between an individual duplicated frame and the corre- 
sponding selected frame. We also propose a robust statistical approach to compute a 
video-level score indicating the likelihood of manipulation or forgery. Additionally, 
for providing manipulation localization information we develop an inconsistency 
detector based on the I3D network to distinguish the duplicated frames from the 
selected original frames. Quantified evaluation on two challenging video forgery 
datasets clearly demonstrates that this approach performs significantly better than 
four state-of-the-art methods. 

It is very important to develop robust video forensic techniques, to catch videos 
with increasing sophisticated forgeries. Video forensics techniques (Milani et al. 
2012; Wang and Farid 2007) aim to extract and exploit features from videos that can 
distinguish forgeries from original, authentic videos. Like other areas in information 
security, the sophistication of attacks and forgeries continue to increase for images 
and videos, requiring a continued improvement in forensic techniques. Robust detec- 
tion and localization of duplicated parts of a video can be a very useful forensic tool 
for those tasked with authenticating large volumes of video content. 

In recent years, multiple digital video forgery detection approaches have been 
employed to solve this challenging problem. Wang and Farid (2007) proposed a frame 
duplication detection algorithm which takes the correlation coefficient as a measure of 
similarity. However, such an algorithm results in a heavy computational load due to a 
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large number of correlation calculations. Lin et al. (2012) proposed to use histogram 
difference (HD) instead of correlation coefficients as the detection features. The 
drawback is that the HD features do not show strong robustness against common 
video operations or attacks. Hu et al. (2012) propose to detect duplicated frames 
using video sub-sequence fingerprints extracted from the DCT coefficients. Yang et 
al. (2016) propose an effective similarity-analysis-based method that is implemented 
in two stages, where the features are obtained via SVD. Ulutas et al. propose to use 
a BoW model (Ulutas et al. 2018) and binary features (Ulutas et al. 2017) for frame 
duplication detection. Although deep learning solutions, especially those based on 
convolution neural networks, have demonstrated promising performance in solving 
many challenging vision problems such as large-scale image recognition (Kaiming 
et al. 2016; Stock and Cisse 2018), object detection (Shaoqing et al. 2015; Yuhua 
et al. 2018; Tang et al. 2018) and visual captioning (Venugopalan et al. 2015; Aneja 
et al. 2018; Huanyu et al. 2018), no deep learning solutions were developed for this 
specific task at the time, which motivated us to fill this gap. 

In Chengjiang et al. (2019) we describe a coarse-to-fine deep learning framework, 
called C2F-DCNN, for frame duplication detection and localization in forged videos. 
As illustrated in Fig. 13.4, we first utilize an IBD network (Carreira and Zisserman 
2017) to obtain the candidate duplicate sequences at a coarse level; this helps narrow 
the search faster through longer videos. Next, at a finer level, we apply a Siamese 
network composed of two ResNet networks (Kaiming et al. 2016) to further confirm 
duplication at the frame level to obtain accurate corresponding pairs of duplicated 
and selected original frames. Finally, the duplicated frame range can be distinguished 
from the corresponding selected original frame range by our inconsistency detector 
that is designed as an I3D network with 16-frames as an input video clip. 

Unlike other methods, we consider the consistency between two consecutive 
frames from a 16-frame video clip in which these two consecutive frames are at 
the center, i.e., 8th and 9th frame. This is aimed at capturing the temporal context 
for matching a range of frames for duplication. Inspired by Long et al. (2017), we 
design an inconsistency detector based on the I3D network to cover three categories, 
i.e., “none”, “frame drop” and “shot break”, which represent that between the 8th 
and 9th frame there are no manipulations, frames removal within one shot, and a shot 
boundary transition, respectively. Therefore, we are able to use output scores from the 
learned I3D network to formulate a confidence score of inconsistency between any 
two consecutive frames to distinguish the duplicated frame range from the selected 
original frame range, even in videos with multiple shots. 

We also proposed a heuristic strategy to produce a video-level frame duplication 
likelihood score. This is built upon the measures like the number of possible frames 
duplicated, the minimum distance between duplicated frames and selected frames, 
and the temporal gap between the duplicated frames and the selected original frames. 

To summarize, the contributions of this approach (Chengjiang et al. 2019) are as 
follows: 


e A novel coarse-to-fine deep learning framework for frame duplication detection 
and localization in forged videos. This framework features fine-tuned I3D networks 
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and the ResNet Siamese network, providing a robust yet efficient approach to 
process large volumes of video data. 

Designed an inconsistency detector based on a fine-tuned I3D network that covers 
three categories to distinguish duplicated frame range from the selected original 
frame range. 

A heuristic formulation for video-level detection score, which leads to significant 
improvement in detection benchmark performance. 

Evaluated performance on two video forgery datasets and the experimental results 
strongly demonstrate the effectiveness of the proposed method. 


13.2 Related Work 


13.2.1 Frame Deletion Detection 


The most related prior work can be roughly split into two categories: video inter- 
frame forgery identification and shot boundary detection. 

Video inter-frame forgery identification. Video inter-frame forgery involves 
frame insertion and frame deletion. Wang et al. proposed an SVM method (Wang 
et al. 2014) based on the assumption that the optical flows are consistent in an 
original video, while in forgeries the consistency will be destroyed. Chao’s optical 
flow method (Chao et al. 2012) provides different detection schemes for inter-frame 
forgery based on the observation that the subtle difference between frame insertion 
and deletion. Besides optic flow, Wang et al. (2014) also extracted the consistency of 
correlation coefficients of gray values as distinguishing features to classify original 
videos and forgeries. Zheng et al. (2014) proposed a novel feature called block-wise 
brightness variance descriptor (BB VD) for fast detection of video inter-frame forgery. 
Different from this inter-frame forgery identification, our proposed C3D-based net- 
work (Chengjiang et al. 2017) is able to explore the powerful spatio-temporal rela- 
tionships as the authentication of the scene or camera in a video for frame dropping 
detection. 

Shot Boundary Detection. There is a large amount of work to solve the shot 
boundary detection problem (Smeaton et al. 2010). The task of shot boundary detec- 
tion (Smeaton et al. 2010) is to detect the boundaries to separate multiple shots 
within a video. The TREC video retrieval evaluation (TRECVID) is an important 
benchmark dataset for automatic shot boundary detection challenge. And different 
research groups from across the world have worked to determine the best approaches 
to shot boundary detection using a common dataset and common scoring metrics. 
Instead of detecting where two shots are concatenated, we are focused on detecting 
a frame drop within a single shot. 
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16-frame clip 


Check whether there exist dropped frames 
between the 8-th and 9-th frame? 


Training stage 


Testing stage 


Output scores Confidence scores 


Fig. 13.3 The pipeline of the C3D-based method. At the training stage, the C3D-based network 
takes 16-frame video clips extracted from the video dataset as input, and produces two outputs, i.e., 
“frame drop” (indicated with “+”) or “no frame drop” (indicated with “~”). At the testing stage, 
we decompose a testing video into a sequence of continuous 16-frame clips and then fit them into 
the learned C3D-based network to obtain the output scores. Based on the score curves, we use a 
peak detection step and introduce a scale term to define the confidence scores to detect/identify 
whether there exist dropped frames for per frame clip or per video. The network model consisted 
of 66 million parameters with 3 x 3 x 3 filter size at all convolutional layers 


13.22 Frame Duplication Detection 


The research related to frame duplication can be broadly divided into inter-frame 
forgery, copy-move forgery and convolutional neural networks. 

Inter-frame forgery refers to frame deletion and frame duplication. For features 
used for inter-frame forgery, either spatially or temporally, keypoints are extracted 
from nearby patches recognized over distinctive scales. Keypoint-based methodolo- 
gies can be further subdivided into direction-based (Douze et al. 2008; Le et al. 2010), 
keyframe-based coordinating (Law-To et al. 2006) and visual-words-based (Sowmya 
and Chennamma 2015). In particular, keyframe-based feature has been shown to per- 
form well for close video picture/feature identification (Law-To et al. 2006). 

In addition to keypoint-based features, Wu et al. (2014) propose a velocity field 
consistency-based approach to detect inter-frame forgery. This method is able to 
distinguish the forgery types, identify the tampered video and locate the manipulated 
positions in forged videos as well. Wang et al. (2014) propose to make full use of 
the consistency of the correlation coefficients of gray values to classify original 
videos and inter-frame forgeries. They also propose an optical flow method (Wang 
et al. 2014) based on the assumption that the optical flows are consistent in an 
original video, while in forgeries the consistency will be destroyed. The optical flow 
is extracted as a distinguishing feature to identify inter-frame forgeries through a 
support vector machine (SVM) classifier to recognize frame insertion and frame 
deletion forgeries. 


Sequence-to-sequence Frame-to-frame = Selected frames 
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Fig. 13.4 The C2F-DCNN framework for frame duplication detection and localization. Given a 
testing video, we first run the I3D network (Carreira and Zisserman 2017) to extract deep spatial- 
temporal features and build the coarse sequence-to-sequence distance to determine the possible 
frame sequences that are likely to have frame duplication. For the likely duplicated sequences, a 
ResNet-based Siamese network further confirms a frame duplication at the frame level. For the 
videos with duplication detected, temporal localization is determined with an 13D-based inconsis- 
tency detector to distinguish the duplicated frames from the selected frames 


Huang et al. (2018) proposed a fusion of audio forensics detection methods for 
video inter-frame forgery. Zhao et al. (2018) developed a similarity analysis-based 
method to detect inter-frame forgery in a video shot. In this method, the HSV color 
histogram is calculated to detect and locate tampered frames in the shot, and then 
the SURF feature extraction and FLANN (Fast Library for Approximate Nearest 
Neighbors) matching are used for further confirmation. 

Copy-move forgery is created by copying and pasting content within the same 
frame, and potentially post-processing it (Christlein et al. 2012; D’ Amiano et al. 
2019). Wang et al. (2009) propose a dimensionality reduction approach through prin- 
cipal component analysis (PCA) on the different pieces. Mohamadian et al. (2013) 
develop a singular value decomposition (SVD) based method in which the image 
is isolated into numerous little covering squares and after that SVD is requested to 
remove the copied frames. Yang et al. (2018) proposed a copy-move forgery detec- 
tion based on a modified SIFT-based detector. Wang et al. (2018) presented a novel 
block-based robust copy-move forgery detection approach using invariant quater- 
nion exponent moments. D‘Amiano et al. (2019) proposed a dense-field method 
with a video-oriented version of PatchMatch for the detection and localization of 
copy-move video forgeries. 

Convolutional neural networks (CNNs) have been demonstrated to learn rich, 
robust and powerful features for large-scale video classification (Karpathy et al. 
2014). Various 3D CNN architectures (Tran et al. 2015; Carreira and Zisserman 
2017; Hara et al. 2018; Xie et al. 2018) have been proposed to explore spatio-temporal 
contextual relations between consecutive frames for representation learning. Unlike 
the existing methods for inter-frame forgery and copy-move forgery which mainly 
use hand-crafted features or bag-of-words, we take advantage of convolutional neural 
networks to extract spatial and temporal features for frame duplication detection and 
localization. 
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13.3 Frame Deletion Detection 


There is limited work exploring frame deletion or dropping detection problem with- 
out reference or original video. Therefore, we first introduce a series of baselines, 
including cue-based and learning-based methods, and then introduce our proposed 
C3D-based CNN. 


13.3.1 Baseline Approaches 


We studied three different cue-based baseline algorithms from the literature, i.e., (1) 
color histogram, (2) optical flow (Wang et al. 2014; Chao et al. 2012) and (3) motion 
energy (Stephen 2009) as follows: 


Color histogram. We calculate the histograms on all R, G and B three chan- 
nels. Whether there are frames dropped between the two consecutive frames is 
detected by thresholding the score calculated by the L2 distances based on the 
color histograms of the two adjacent frames. 

Optical flow. We calculate the optical flow (Wang et al. 2014; Chao et al. 2012) 
from the two adjacent frames by the Lucas-Kanade method. Whether there exist 
frames dropped between the current frame and the next frame is detected by 
thresholding the L2 distance between the average moving direction between the 
previous frame and the current frame, and the average moving direction between 
the current frame and the next frame. 

Motion energy. Motion energy is the temporal information (TI) difference 
sequence (Stephen 2009), i.e., the difference of Y channel in the YCrCb color 
space. Whether there exist frames dropped between the current frame and the next 
frame is detected by thresholding the motion energy between the current frame 
and the next frame. 


Note that each algorithm mentioned above compares two consecutive frames and 
estimates whether there are missing frames between them. We also developed four 
learning-based baseline algorithms as follows: 


e SVM. We train an SVM model to predict whether there are frames dropped 
between two adjacent frames. The feature vector is the concatenation of the abso- 
lute difference of color histograms and the two-dimensional absolute difference of 
the optical flow directions. The optical flow dimensionality is much smaller than 
the color histogram, and therefore we give it a higher weight. 

Pairwise Siamese Network. We train a Siamese CNN that determines if the two 
input frames are consecutive or if there is frame dropping between them. Each 
CNN consists of two convolutional layers and three fully connected layers. The 
loss used is contrastive loss. 

Triplet Siamese Network. We extend the pairwise Siamese network to use three 
consecutive frames. Unlike the pairwise Siamese network, the triplet Siamese 
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Table 13.1 A list of related algorithms for temporal video manipulation detection. The first three 
algorithms are cue-based without any training work. The rest are learned-based algorithms, including 
the traditional SVM, the popular CNNs and the method we proposed in Chengjiang et al. (2019) 


Method Brief description Learning? 

Color histogram RGB 3 channel histograms + | No 
L2 distance 

Optical flow The optic flow (Wang et al. No 
2014; Chao et al. 2012) with 
Lucas-Kanade method + L2 
distance 

Motion energy Based on temporal information | No 
difference (Stephen 2009) 
sequence 

SVM 770-D feature vector Yes 
(3 x 256-D RGB histogram + 
2-D optic flow) 

Pairwise Siamese Network Siamese network architecture | Yes 
(2 conv layers + 3 fc layers + 
contrastive loss) 

Triplet Siamese Network Siamese network architecture | Yes 
(Alexnet-variant + 
Euclidean&contrastive loss) 

Alexnet (Krizhevsky et al. Alexnet-variant network Yes 

2012) Network architecture 

C3D-based C3D-variant network Yes 


Network (Chengjiang et al. 


2019) 


architecture + confidence score 


network consisted of three Alexnets (Krizhevsky et al. 2012) merging their output 
with Euclidean loss between the previous frame and the current frame, and with 
contrastive loss between the current frame and the next frame. 
e Alexnet-variant Network. The input frames are converted to gray-scale and put 


into the RGB channels. 


To facilitate the comparison of the competing algorithms, we summarize the above 


descriptions in Table 13.1. 


13.3.2 C3D Network for Frame Deletion Detection 


The baseline CNN algorithms we investigated lacked a strong temporal feature suit- 
able to capture the signature of frame drops. These algorithms only used features from 
two to three frames that were computed independently. C3D network was originally 
designed for action recognition, however, we found that spatio-temporal signature 
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produced by the 3D convolution is also very effective in capturing the frame drop 
signatures. 

The pipeline of our proposed method is as shown in Fig. 13.3. As we can observe, 
there are three modifications from the original C3D network. First, the C3D network 
takes clips of 16 frames, therefore we check the center of the clip (between frames 8 
and 9) for frame drops to give equal context on both sides of the drop. This is done by 
formulating our training data so that frame drops only occur in the center. Secondly, 
we have a binary output associated with “frames dropped” and “no frames dropped” 
between the 8th and 9th frame. Lastly, we further refine the per-frame network output 
scores into a confidence score using peak detection and temporal scaling to further 
suppress the noisy detections. With the refined confidence scores we are not only 
able to identify whether the video has frame drops but also localize them by applying 
the network to the video in a sliding window fashion. 


13.3.2.1 Data Preparation 


To obtain the training data, we used 2,394 iPhone 4 consumer videos from the 
World Dataset made available on the DARPA Media Forensics (MediFor) program 
for research. We pruned the videos such that all videos were of length 1-3 min. 
We get ended up with 314 videos, of which we randomly selected 264 videos for 
training, and the rest 50 videos for validation. We developed a tool that randomly 
drops fixed-length frame sequences from videos. It picks a random number of frame 
drops and random frame offsets in the video for each removal. The frame drops do 
not overlap, and it forces 20 frames to be kept around each drop. In our experiments, 
we manipulate each video many different times to create more data. We vary the 
fixed frame drop length to see how it affects detection we used 0.5 s, 1 s, 2 s, 5 s 
and 10 s as five different frame drop durations. We used the videos with these drop 
durations to train a general C3D-based network for frame drop detection. 


13.3.2.2 Training 


We use momentum jz = 0.9, y = 0.0001 and set power to be 0.075. We start training 
at a base learning rate of œ = 0.001 and the “inv” as the learning rate policy. We 
set the batch size to be 15 and use the 206000th iteration as the learned model for 
testing, which achieves about 98.2% validation accuracy. 


13.3.2.3 Testing 


The proposed C3D-based network is able to identify the temporal removal manipu- 
lation due to dropped frames in a video and also localize one or more frame drops 
within the video. We observe that some videos captured by moving digital cam- 
eras may have multiple changes due to quickly camera motion, zooming in/out, etc., 
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which can be deceiving to the C3D-based network and can result in false frame drop- 
ping detections. In order to reduce such false alarms and increase the generalization 
ability of our proposed network, we propose an approach to refine the raw network 
output scores to the confidence scores using peak detection and introduction of a 
scale term based on the output score variation, i.e., 


1. We first detect the peaks on the output score curve obtained from the proposed 
C3D-based network per video. Among all the peaks, we only pick the top 2% 
peaks and ignore the rest of the peaks. Then we shift the time window to check 
the number of peaks (denoted as n „) appearing in the time window with ith frame 
as the center (denoted as W(i)). If the number is more than one, i.e., other peaks 
in the neighborhood, the output score f (i) will be penalized. The value will be 
penalized more if there are a lot of high peaks detected. The intuition behind is 
that we want to reduce the false alarms when there are multiple peaks occurring 
close just because the camera is moving or even zooming in/out. 

2. We also introduce a scale term A (i) defined as the difference of the median score 
and the minimum score within the time window W (i) to control the influence of 
the camera motion. 


Based on the above statement, we can obtain the confidence score for the ith frame 


as 
fD — XAG) whenn, < 2 


Seong (Ù) = 19 —AAC) otherwise’ (13.1) 

where i ü 
W(i) = {i — —,...,i +}. 13.2 
(i) = {i > i+ 5} ( ) 


Note that A in Eq. 13.1 is a parameter to control how much the scale term affects the 
confidence score, and w in Eq. 13.2 indicates the width of the time window. 

For testing per frame, say ith frame, we first form a 16-frame video clip and set 
the ith frame to be the 8th frame in the video clip, and then we can get the output 
score foong (i). If feont i) > Threshold, then we predict there are dropped frames 
between the ith frame and the (i + 1)th frame. For testing per video, we take it 
as a binary classification and confidence measure per video. To simplify, we use a 
simple confidence measure, i.e., max feon (i) across all frames. If max feons (i) > 


Threshold, then there are temporal removal within the video. Otherwise, the video 
is predicted without any temporal removal. The results reported in this work are 
without any Threshold as we are reporting the ROC curves. 


13.3.3 Experimental Result 


We conducted the experiments on a Linux machine with Intel(R) Xeon(R) CPU E5- 
2687 0 @ 3.10 GHz, 32 GB system memory and graphical card NVIDIA GTX 1080 
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Fig. 13.5 Performance comparison on the YFCC100m dataset against seven baseline approaches, 
using per-frame ROCs for five different drop durations (a—e), and (f) is frame-level ROC for all the 
five drop durations combined 


(Pascal). We report our results as the ROC curves based on the output score feong (i) 
and accuracy as metrics. We present the ROC curves with false positive rate as well 
as false alarm rate per minute to provide and demonstrate the level of usefulness for 
a user that might have to adjudicate each detection reported by the algorithm. We 
present the ROC curves for both per-frame analysis where the ground truth data is 
available and per-video analysis otherwise. 

To demonstrate the effectiveness of the proposed approach, we ran experiments 
on the YFCC100m dataset” and the Nimble Challenge 2017 (Development 2 Beta 
1) dataset. 


13.3.3.1 YFCC100m Dataset 


We download 53 videos tagged with iPhone from Yahoo Flickr Creative Commons 
100 Million (YFCC100m) dataset and manually verified that they are single-shot 
videos. To create ground truth we used our automatic randomized frame dropping 
tool to generate the manipulated videos. For each video we generated manipulated 


2 YFCC100m dataset: http://www.yfcc100m.org. 


3 Nimble Challenge 2017 dataset: https://www.nist.gov/itl/iad/mig/nimble-challenge-2017- 
evaluation. 
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Table 13.2 The detailed results of our proposed C3D-based network. #pos and #neg are the number 
instances for the positive and the negative testing 16-frame video clips, respectively. The Acc pos 
and AcCyeg are the corresponding accuracy. Acc is the total accuracy. All the accuracies use the 
unit % 


Duration (s) 

0.5 2816:416633 

1 2333:390019 98.18 
2 2816:416633 98.12 
5 1225:282355 98.18 
10 770:239210 98.13 


videos with frame drops of 0.5, 1, 2, 5 or 10s intervals at random locations. For each 
video and each drop duration, we randomly generate 10 manipulated videos. In this 
way we collect 53 x 5 x 10 = 2650 manipulated videos as testing dataset. 

For each drop duration, we run all the competing algorithms in Table 13.1 on the 
530 videos with the parameter setting w = 16, À = 0.22. The experimental results 
are summarized in the ROC curves for all these five different drop durations in 
Fig. 13.5. 

One can note that (1) the traditional SVM outperforms the three simple cue-based 
algorithms; (2) the four convolution neural networks algorithms perform much better 
than the traditional SVM and all the cue-based algorithms; (3) among all the CNN- 
based networks, both the triplet Siamese network and the Alexnet-variant network 
perform similar and better than the pairwise Siamese network; and (4) our proposed 
C3D-based network performs the best. This provides some empirical support to the 
hypothesis that the proposed C3D-based method is able to take advantage of the 
temporal and spatial correlations, while the other CNN-based networks only explore 
the spatial information in the individual frames. 

To better understand the C3D-based network, we provide more experimental 
details in Table 13.2. With the drop duration increase, both the number of positive 
and negative testing instances decrease and the positive accuracy keeps increasing. 
As one might expect, the shorter the frame drop duration, the more difficult it is to 
detect. 

We also merge the results of the C3D-based network with five different drop 
durations in Fig. 13.5 together to plot a unified ROC curve. For comparison, we also 
plot another ROC curve that uses the output scores to detect whether there exist frame 
drops within a testing video. As we can see in Fig. 13.5(f), using output score from the 
C3D-based network, we can still achieve very good performance to 0.9983854 AUC. 
This observation can be explained by the fact that the raw phone videos from the 
YFCC100m dataset have less quick motion, no zooming in/out occurring and even 
no video manipulations. Also, the manipulated videos are generated in the same way 
as the generation of training manipulated videos with the same five drop durations. 
Since there are no overlaps on the video contents between training videos and testing 
videos, such a good performance demonstrates the power and the generalization 
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(b) A true negative video clip. 
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Fig. 13.6 The visualization of two successful examples (one true positive and the other one is true 
negative) and two failure examples (one false positive and the other one is false negative) from the 
YFCC100m dataset. The red dashed line indicates the location between the 8th frame and the 9th 
frame where we test for a frame drop. The red arrows point to the frame on the confidence score 
plots 


ability of the trained network. Although using output score directly achieves a very 
good AUC, using the confidence score defined in Eq. 13.1 can still improve the AUC 
from 0.9983854 to 0.9992465. This demonstrates the effectiveness of our confidence 
score defined with such a peak detection step and a scale term. 

We visualize both success and failure cases in our proposed C3D-based network, 
as shown in Fig. 13.6. Looking at the successful cases in Fig. 13.6(a), “frame drops” 
is identified correctly in the 16-frame video clip because a man stands at one side in 
the 8th frame and move to another side suddenly in the 9th frame, and the video clip 
in Fig. 13.6(b) is predicted as “no frame drops” correctly since a child follows his 
father in all 16 frames and the 8th frame and the 9th frame are consistent with each 
other. 

Regarding the failures cases, as shown in Fig. 13.6(c), there is no frame drop but it 
is still identified as “frame drop” between the 8th frame and the 9th frame due to the 
camera shakes during the video capture of such a street scene. Also, “frame drop” 
in the top clip cannot be detected correctly between the 8th frame and the 9th frame 
in the video clip, as shown in Fig. 13.6(d), since the scene inside the bus has almost 
no visible changes between these two frames. 
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Fig. 13.7 The ROC curve of our proposed C3D-based network on the Nimble Challenge 2017 
dataset 


Note that our training stage is carried out off-line. Here we only offer the runtime 
for the testing stage under our experimental environment. For each testing video clip 
with a 16-frame length, it takes about 2 s. For a one-minute short video with 30 FPS, 
it requires about 50 min to complete the testing throughout all the frame sequence. 


13.3.3.2 Nimble Challenge 2017 Dataset 


In order to check whether our proposed C3D-based network is able to identify a 
testing video with unknown arbitrary drop duration, we also conducted experiments 
on the Nimble Challenge 2017 dataset, specifically the NC2017-Dev2Betal version, 
in which there are 209 probe videos with various video manipulations. Among these 
videos, there are six videos manipulated with “TemporalRemove’”, which is regarded 
as “frame dropping”. Therefore, we run our proposed C3D-based network as a binary 
classifier to classify all these 209 videos into two groups, i.e., “frame dropping” and 
“no frame dropping”, at the video level. In this experiment, the parameters are set as 
w = 500, A = 1.25. 

We first plot the output scores from the C3D-based network and the confidence 
score of each of the six videos is labeled with “TemporalRemove” in Fig. 13.9. It 
is clear that the video named “d3c6bf5£224070f1df74a63c232e360b.mp4” has the 
lowest confidence score smaller than zero. 

To explain such a case, we further check the content of the video, as shown in 
Fig. 13.8. As we can observe, this video is even really hard for us to identify it as 
“TemporalRemoval” since it is taken by a static camera and only the lady’s mouth 
and head are taking very slight changes across the whole video from the beginning 
to the end. As we trained purely on iPhone videos, our training network was biased 
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Fig. 13.8 The entire frame sequence of the 34-second video 
“d3c6bf5£224070f1d£74a63c232e360b.mp4”, which has 1047 frames and was captured by a 
static camera. We observe that only the lady’s mouth and head are taking very slight change across 
the video from the beginning to the end 
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Fig. 13.9 The illustration of output scores from the C3D-based network and their confidence scores 
for six videos labeled with “TemporalRemove” from the Nimble Challenge 2017 dataset. The blue 
curve is the output score, the red “+” marks the detected peaks, and the red confidence score is used 
to determine whether the video can be predicted as a video with “frame drops” 


toward videos with camera motion. With a larger dataset of static camera videos, we 
can train different networks for static and dynamic cameras to address this problem. 

We plot the ROC curve in Fig. 13.7. As we can see, the AUC of the C3D-based 
network with confidence scores is high to 0.96, while the AUC of the C3D-based 
network with the output scores directly is only 0.86. The insight behind such a 
significant improvement is that there are testing videos with camera quick-moving, 
zooming in and out, as well as other types of video manipulations, and our confidence 
scores defined with the peak detection step and the scale term to penalize multiple 
peaks occurring too close and large scales is able to significantly reduces the false 
alarms. Such a significant improvement by 0.11 AUC strongly demonstrates the 
effectiveness of our proposed method. 
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13.4 Frame Duplication Detection 


As shown in Fig. 13.4, given a probe video, our proposed C2F-DCNN framework is 
designed to detect and localize frame duplication manipulation. An I3D network is 
used to produce a sequence-to-sequence matrix and determine the candidate frame 
sequences at the coarse-search stage. A Siamese network is then applied for a fine- 
level search to verify whether frame duplications exist. After this, an inconsistency 
detector is applied to further distinguish duplicated frames from selected frames. All 
of these steps are described below in detail. 


13.4.1 Coarse-Level Search for Duplicated Frame Sequences 


In order to efficiently narrow the search space, we start by finding possible duplicate 
sets of frames throughout the video using a robust CNN representation. We split a 
video into overlapping frame sequences, where each sequence has 64 frames and the 
number of overlapped frames is 16. We choose I3D Network (Carreira and Zisserman 
2017), instead of using C3D network (Tran et al. 2015) due to these reasons: (1) It 
inflates 2D ConvNets into 3D and makes filters from typically N x N square to 
NxNxN cubic; (2) it bootstraps 3D filters from 2D filters to bootstrap parameters 
from the pre-trained ImageNet models; and (3) it paces receptive field growth in 
space, time and network depth. 

In this work, we apply the pre-trained off-the-shell I3D network to extract the 
1024-dimensional feature vector for k = 64 frame sequences since the input for the 
standard I3D network is 64 rgb-data and 64 flow-data. We observed that a lot of 
time was being spent on the pre-processing. To reduce the testing runtime, we only 
compute the first k rgb-data and k flow-data items. For the subsequent frame sequence, 
we can copy (k — 1) rgb-data and (k — 1) flow-data from the previous video clip, 
and only calculate the last rgb-data and flow-data. This significantly improved the 
testing efficiency. 

Based on the sequence features, we calculate the sequence-to-sequence distance 
matrix over the whole video using L2 distance. If the distance is smaller than the 
threshold T;, then this indicates that these two frame sequences are likely duplicated 
and we take them as two candidate frame sequences for further confirmation during 
the next fine-level search. 


13.42 Fine-Level Search for Duplicated Frames 


For the candidate frame sequences, detected by the previous stage described in 
Sect. 13.4.1, we evaluate the distance between all pairs of frames across the two 
sequences, i.e., a duplicated frame and the corresponding selected original frame. 
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Fig. 13.10 A sample distance matrix based on the frame-to-frame distances computed by the 
Siamese network between a pair of frame sequences. The symbols shown on the line segment with 
low distance are used to compute the video-level confidence score for frame duplication detection 


For this purpose we propose a Siamese neural network architecture, which learns 
to differentiate between two frames in the provided pair. It consists of two identical 
networks by sharing exactly the same parameters, each taking one of the two input 
frames. A contrastive loss function is applied to the last layers to calculate the dis- 
tance between the pair. In principle, we can choose any neural network to extract 
features for each frame. 

In this work, we choose the ResNet network (Kaiming et al. 2016) with 152 
layers given its demonstrated robustness. We connect two ResNets in the Siamese 
architecture with a contrastive loss function, and each loss value associated with the 
distance between a pair of frames is formulated into the frame-to-frame distance 
matrix, in which the distance is normalized to the range [0, 1]. A distance smaller 
than the threshold T, indicates that these two frames are likely duplicated. For videos 
that have multiple consecutive frames duplicated we expect to see a line with low 
values parallel to the diagonal in the visualization of the distance matrix, as plotted 
in Fig. 13.10. 

It is worth mentioning that we provide both frame-level and video-level scores to 
evaluate the likelihood of frame duplication. For the frame-level score, we can use the 
value in the frame-to-frame distance directly. For the video-level score, we propose a 
heuristic strategy to formulate the confidence value. We first find the minimal value of 
distance dmin = d (imin, Jmin) Where imin, Jmin = argmino-; < Jen d(i, j) is the frame- 
to-frame distance matrix. Then a search is performed in two directions to find the 
number of consecutive duplicated frames: 


kı = argmax |d (imin — k, Jmin — k) — dmin| < € (13.3) 
k:k<imin 
and 
k2 = argmax |d(imin + k, jmin + k) — dmin| < € (13.4) 
k:k<n— jmin 


where e = 0.01 and the length of the interval with duplicated frames can be defined 
as 
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[=k + k; + 1. (13.5) 


Finally, we can formulate the video-level confidence score as follows: 


dmin 


lx (Jmin = imin) 


Fyideo = (13.6) 


The intuition here is that a more likely frame duplication is indicated by a smaller 
value of din, a longer interval of duplicated frames and a larger temporal gap between 
the selected original frames and the duplicated frames. 


13.4.3 Inconsistency Detector for Duplication Localization 


We observe that the duplicated frames inserted into the source video usually yield 
artifacts due to temporal inconsistency at both the beginning frames and the end 
frames in a manipulated video. To automatically distinguish the duplicated frames 
from selected frames, we make use of both spatial and temporal information by 
training an inconsistency detector to locate this temporal discrepancy. For this pur- 
pose, we build upon our work discussed above, Long et al. (2017), which proposed a 
C3D-based network for frame-drop detection and only works for single-shot videos. 
Instead of using only one RGB stream data as input, we replace the C3D network 
with an I3D network to also incorporate the optical flow data stream. It is also worth 
mentioning that unlike the I3D network used in Sect. 13.4.1, input to the I3D network 
here is a 16-frame temporal interval, every frame in a sliding window, with RGB 
and optical flow data. The temporal classification provides insight into the tempo- 
ral consistency between the 8th and the 9th frame within the 16-frame interval. In 
order to handle multiple shots in a video with hard cuts, we extend the binary clas- 
sifier to three classes: “none”—no temporal inconsistency indicating manipulation; 
“frame drop’—there are frames removed within one-shot video; and “shot break” or 
“break” —there is a temporal boundary or transition between two video shots. Note 
that the training data with shot-break videos are obtained from TRECVID 2007 
dataset (Kawai et al. 2007), and we only use the hard-cut shot-breaks since soft-cut 
changes gradually and has strong consistency between any two consecutive frames. 
The confusion matrix in Fig. 13.11 illustrates the high effectiveness of the proposed 
13D network-based inconsistency detector. 

Based on the output scores for the three categories from the I3D network, i.e., 
SS G), cee (i) and S?5<* (i), we formulate the confidence score of inconsistency 
as the following function: 


SÒ = Sp ü) + SHH" D — ASSO (13.7) 


where à is the weight parameter, and for the results presented here, we use A = 0.1. 
We assume the selected original frames have a higher temporal consistency with 


13 Video Frame Deletion and Duplication 353 


none 
drop 
break 


none drop break 


Fig. 13.11 The confusion matrix for three classes of temporal inconsistency within a video, used 
with the I3D-based inconsistency. We expect a high likelihood of “drop” class at the two ends of the 
duplicated frame sequence and a high “none” likelihood at the ends of the selected original frame 
sequence 
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Fig. 13.12 Illustration of distinguishing duplicated frames from the selected frames. The index 
ranges for the red frame sequence and the green sequence are [72, 191] and [290, 409], respectively. 
sı and s2 are the corresponding inconsistency scores for the red sequence and green sequence, 
respectively. Obviously, sı > s2, which indicates that the red sequence is duplicated frames as 
expected 


frames before and after such frames than the duplicated frames because the insertion 
of duplicated frames usually causes a sharp inconsistency at the beginning and the 
end of the duplicated interval, as illustrated in Fig. 13.12. Given a pair of frame 
sequences that are potentially duplicated, [i, i + I] and [j, j + 1], we compare two 


scores, 
wind 


sp= >` S@—-—1+k)4+S@+1+k) (13.8) 


k=—wind 


and 
wind 


s= D> SG-1+kh4+SG+I1+h), (13.9) 


k=—wind 


where wind is the window size. We check the inconsistency at both the beginning 
and the end of the sequence. In this work, we set wind = 3 to avoid the failure 
cases where a few start or end frames were detected incorrectly. If s, > s2, then the 
duplicated frame segment is [7, i + /]. Otherwise, the duplicated frame segmentation 
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Fig. 13.13 Illustration of frame-to-frame distance between duplicated frames and the selected 
frames 


is [j, j +l]. As shown in Fig. 13.12, our modified I3D network is able to measure 
the consistency between consecutive frames. 


13.4.4 Experimental Results 


We evaluate our proposed C2F-DCNN method on a self-collected video dataset and 
the Media Forensics Challenge 2018 (MFC18)* dataset (Guan et al. 2019). 

Our self-collected video dataset is obtained through automatically adding frame 
duplication manipulation on the 12 raw static camera videos from VIRAT dataset (Oh 
et al. 2011) and 17 dynamic iPhone 4 videos. The duration of each video is in the 
range from 47seconds to 3minutes. In order to generate test videos with frame 
duplication, we randomly select frame sequences with the duration 0.5, 1, 2, 5 and 
10s, and then re-insert them into the same source videos. We use the X264 video 
codec and a frame rate of 30 fps to generate these manipulated videos. Note that we 
avoid any temporal overlap between the selected original frames and the duplicated 
frames in all generated videos. Since we have the frame-level ground truth, we can 
use it for frame-level performance evaluation. 

The MFC18 dataset consists of two subsets, Dev dataset and Eval dataset, which 
we denote as the MFC18-Dev dataset and the MFC18-Eval dataset, respectively. 
There are 231 videos in the MFC18-Dev dataset and 1036 videos in the MFC18- 
Eval dataset. The duration of each video is in the range from 2 seconds to 3 minutes. 
The frame rate for most of the videos is 29-30 fps, while a smaller number of videos 
are 10 or 60fps and only five videos in the MFC18-Eval dataset are larger than 
240 fps. We opt out these five videos and another two videos which have less than 17 


4 https://www.nist.gov/itl/iad/mig/media-forensics-challenge-2018. 
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frames from the MFC18-Eval dataset because the input for the 13D network should 
have at least 17 frames. We use the remaining 1029 videos in the MFC18-Eval dataset 
to conduct the video-level performance evaluation. 

The detection task is to detect whether or not a video has been manipulated with 
frame duplication manipulation, while the localization task to localize the duplicated 
frames index. For the measurement metrics, we use the performance measures of area 
under the ROC curve (AUC) for the detection task, and use the Matthews correlation 
coefficient 


TP x TN — FP x FN 


MCC = 
JOP + FP)(TP + FN)(TN + FP)(TN + FN) 


for localization evaluation, where TP, FP, TN and FN refer to frames which represent 
true positive, false positive, true negative and false negative, respectively. See Guan 
et al. (2019) for further details on the metrics. 


13.4.4.1  Frame-Level Analysis on Self-collected Dataset 


To better verify the effectiveness of deep learning solution in frame-duplication 
detection on the self-collected dataset, we consider four baselines: Lin et al.’s 
method (Guo-Shiang and Jie-Fan 2012) that uses histogram difference as the detec- 
tion features, Yang et al.’s method (Yang et al. 2016) that is an effective similarity- 
analysis-based method with SVD features, Ulutas et al.’s method (Ulutas et al. 2017) 
based on binary features and another method by them (Ulutas et al. 2018) that uses 
bag-of-words with 130-dimensional SIFT descriptors. Different from our proposed 
C2F-DCNN method, all of these methods use traditional feature extraction without 
deep learning. 

Note that the manipulated videos are generated by us, hence both selected original 
frames and duplicated frames are accessible to us. We treat these experiments as a 
white-box attack and evaluate the performance of frame-to-frame distance measure- 
ments. 

We run the proposed C2F-DCNN approach and the above-mentioned four state- 
of-the-art approaches on our self-collected dataset and the results are summarized 
in Table 13.3. As we can see, due to the X264 codec, the contents of the duplicated 
frames have been affected so that the detection of a duplicated frame and its corre- 
sponding selected frame is very challenging. In this case, our C2F-DCNN method 
still outperforms the four previous methods. 

To help the reader better understand the comparison, we provide a visualization 
of the normalized distances between the selected frames and the duplicated frames 
in Fig. 13.13. We can see our C2F-DCNN performs the best for both sample videos, 
especially with respect to the ability to distinguish the temporal boundary between 
duplicated frames and non-duplicated frames. All these observations strongly demon- 
strate the effectiveness of this deep learning approach for frame duplication detection. 
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Table 13.3 The AUC performance of frame-to-frame distance measurements for frame duplication 
detection on our self-collected video dataset.(unit: %) 


Method iPhone 4 videos VIRAT videos 
Lin 2012 (Guo-Shiang and 80.81 80.75 

Jie-Fan 2012) 

Yang 2016 (Yang et al. 2016) | 73.79 82.13 

Ulutas 2017 (Ulutas et al. 70.46 81.32 

2017) 

Ulutas 2018 (Ulutas et al. 73.25 69.10 

2018) 

C2F-DCNN 81.46 84.05 


Fig. 13.14 The ROC curves 
for video-level frame 
duplication detection on the 
MFC18-Dev dataset 


Fig. 13.15 The ROC curves 
for video-level frame 
duplication detection on the 
MFC 18-Eval dataset 
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13.4.4.2 Video-Level Analysis on the MFC18 Dataset 
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It is worth mentioning that the duplicated videos in the MFC18 dataset usually 
include multiple manipulations, and this makes the content between the selected 
original frames and duplicated frames different at times. Therefore, the testing video 
in both the MFC18-Dev and the MFC18-Eval datasets are very challenging. Since we 
are not aware of the details about the generation of all the testing videos, we take this 
dataset as a black-box attack and evaluate its video-level detection and localization 


performance. 
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Table 13.4 The MCC metric in [—1.0, 1.0] range for video temporal localization on the MFC18 
dataset. Our approach generates the best MCC score, where 1.0 is perfect 


Method MFC18-Dev MFC 18-Eval 
Lin 2012 (Guo-Shiang and 0.2277 0.1681 
Jie-Fan 2012) 

Yang 2016 (Yang et al. 2016) | 0.1449 0.1548 
Ulutas 2017 (Ulutas et al. 0.2810 0.3147 
2017) 

Ulutas 2018 (Ulutas et al. 0.0115 0.0391 
2018) 

C2F-DCNN w/ ResNet 0.4618 0.3234 
C2F-DCNN w/ C3D 0.6028 0.3488 
C2F-DCNN w/ I3D 0.6612 0.3606 


Table 13.5 The video temporal localization performance on the MFC18 dataset. Note ,/, x and 
@ indicate correct cases, incorrect cases and ambiguously incorrect cases, respectively. And #(.) 
indicates the number of a kind of specific cases 


Dataset #(/) #(x) #(@) 
MFC18-Dev 14 6 1 
MFC18-Eval 33 38 15 


We compare the proposed C2F-DCNN method and the above-mentioned four 
state-of-the-art methods, i.e., Lin 2012 (Guo-Shiang and Jie-Fan 2012), Yang 
2016 (Yang et al. 2016), Ulutas 2017 (Ulutas et al. 2017) and Ulutas 2018 (Ulutas et al. 
2018) on these two datasets. We use the negative minimum distance (i.e., —dmin) as a 
default video-level scoring method to generate a video-level score for each compet- 
ing method, including ours. “C2F-DCNN+confscore” denotes our best configuration 
with C2F-DCNN along with the proposed video-level confidence score defined in 
Eq. 13.6. In contrast, “C2F-DCNNa” uses only —dmin as the confidence score. The 
comparative manipulated video detection results are summarized in Figs. 13.14 and 
13.15. 

A few observations that we would like to point out: (1) C2F-DCNN always out- 
performs the four previous methods for the video-level frame duplication, with the 
video-level score as negative minimum distance; (2) with “+conf score”, our “C2F- 
DCNN+confscore” method generates a significant boost in AUC as compared to 
the baseline score of —dmin and achieves a high correct detection rate at a low false 
alarm rate; and (3) the proposed “C2F-DCNN+confscore” method achieves very high 
AUC scores on the two benchmark datasets: 99.66% on MFC18-Dev, and 98.02% 
on MFC18-Eval. 

We also performed a quantified analysis of the temporal localization within a 
manipulated video with frame duplication. For comparison with the four previ- 
ous methods, we use the feature distance between any two consecutive frames. 
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Fig.13.16 The visualization of confusion bars in video temporal localization. For each subfigure, 
the top (purple) bar is ground truth indicating duplication, the middle bar (pink) is the system output 
from the proposed method and the bottom bar is the confusion calculated based on the above the 
truth and the system output. Note TN, FN, FP, TP and “OptOut” in the confusion are marked in 
white, blue, red, green and yellow/black, respectively. a and b—d are correct results, which include 
completely correct cases and partially correct cases. e and f show the failure cases 


For the proposed C2F-DCNN approach, the best configuration “C2F-DCNN w/ 
I3D” includes the I3D network as the inconsistency detector. We also provide two 
baseline variants by replacing the [3D inconsistency detector with a ResNet net- 
work feature distance Saj, (i) only (““C2F-DCNN w/ ResNet’) or the C3D network’s 
scores sap (i) — AS¢35 (i) from (Chengjiang et al. 2017) (“C2F-DCNN w/ C3D”). 
The temporal localization results are summarized in Table 13.4, from which we can 
observe that (1) our deep learning solutions, “C2F-DCNN w/ ResNet’, “C2F-DCNN 
w/ C3D” or “C2F-DCNN w/ I3D” work better than the four previous methods and 
“C2F-DCNN w/ I3D” performs the best. These observations suggest that 3D convo- 
lutional kernel is able to measure the inconsistency between the consecutive frames, 
and both RGB data stream and optical flow data stream are complementary to further 
improve the performance. 

To better understand the video temporal localization measurement, we plot the 
confusion bars on the video timeline based on the truth and the corresponding system 
output under different scenarios, as shown in Fig. 13.16. We would like to emphasize 
that no algorithm is able to distinguish duplicated frames from selected frames for 
the ambiguously incorrect cases indicated as @ in Table 13.5, because such videos 
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often break the assumption of temporal consistency and in many cases the duplicated 
frames are difficult to identify by the naked eye. 


13.5 Conclusions and Discussion 


We presented a C3D-based network with a confidence score defined with a peak 
detection step and a scale term for frame dropping detection. The method we pro- 
posed in Chengjiang et al. (2017) flexibly explores the underlying spatio-temporal 
relationship across the one-shot videos. Empirically, it is not only able to identify 
manipulation of temporal removal type robustly but also to detect the exact location 
where the frame dropping occurred. 

Our future work includes revising frame dropping strategy to be more realistic for 
training video collection, evaluating an LSTM-based network for quicker runtime, 
and working on other types of video manipulation detection such as addressing shot 
boundaries and duplication in looping cases. 

Multiple factors cause frame duplication detection and localization becoming 
more and more challenging in video forgeries. These factors include high frame rates, 
multiple manipulations (e.g. “SelectCutFrames”, “TimeAlterationWarp”, 
“AntiForensicCopyExif”, “RemoveCamFingerprintPRNU”) involved before and 
after, and gaps between the selected frames and the duplicated frames. In partic- 
ular, zero gap between the selected frames and the duplicated frames renders the 
manipulation undetectable because the inconsistency which should exist at the end 
of the duplicated frames does not appear in the video temporal context. 

Regarding the runtime, the I3D network for inconsistency detection is the most 
expensive component in our framework but we only apply it on the candidate frames 
that are likely to have frame duplication manipulations detected in the coarse-search 
stage. For each testing video clip with a 16-frame length, it takes about 2s with our 
learned I3D network. For a one-minute short video with 30 FPS, it requires less than 
5 min to complete the testing throughout all the frame sequences. 

The coarse-to-fine deep learning approach is designed for frame duplication detec- 
tion at both frame-level and video-level, as well as for video temporal localization. 
This work also included a heuristic strategy to formulate the video-level confidence 
score, as well as an I3D network-based inconsistency detector to distinguish the 
duplicated frames from the selected frames. The experimental results have demon- 
strated the robustness and effectiveness of the method. 

Our future work includes continuing to extend multi-stream 3D neural networks 
for both frame drop, frame duplication and other video manipulation tasks like loop- 
ing detection, working on frame-rate variations and train on multiple manipulations, 
investigating the effects of various video codecs on algorithm accuracy. 


5 These manipulation operation are defined in the MFC18 dataset. 
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Chapter 14 R) 
Integrity Verification Through File herst 
Container Analysis 


Alessandro Piva and Massimo Iuliani 


In the previous chapters, multimedia forensics techniques based on the analysis 
of the data stream, i.e., the audio-visual signal, aimed at detecting artifacts and 
inconsistencies in the (statistics of the) content were presented. Recent research 
highlighted that useful forensic traces are also left in the file structure, thus offering 
the opportunity to understand a file’s life-cycle without looking at the content itself. 
This Chapter is then devoted to the description of the main forensic methods for the 
analysis of image and video file formats. 


14.1 Introduction 


Most forensic techniques look for traces within the content itself that are, however, 
mostly ineffective in some scenarios, for example, when dealing with strongly com- 
pressed or low resolution images and videos. Recently, it has been shown that also the 
image or video file container maintain some traces of the content history. The analysis 
of the file format and metadata allows to determine their compatibility, complete- 
ness, and consistency based on the expected media’s history. This analysis proved to 
be strongly promising since the image and video compression standards leave some 
freedom in the file container’s generation. This fact allows forensic practitioners to 
identify media’s container signatures related to a specific brand, model, social media 
platform, etc. Furthermore, the cost for analyzing a file header is usually negligible. 
This is even more relevant when dealing with digital videos where the stream analy- 
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sis can quickly become unfeasible. Indeed, recent methods highlight that video can 
be analyzed in seconds, independently of their length and resolution. 

It is worth also mentioning that most of the achieved results highlighted that most 
interesting containers’ signatures usually have a low intra-variability, thus suggesting 
that this analysis can be promising even when only a few training data are available. 
On the other side, these techniques’ main drawback is that they only allow forgery 
detection while localization is usually beyond its capabilities. 

Furthermore, the design of practical media containers’ parsers is not to be under- 
estimated since it requires a deep comprehension of image and video formats, and the 
parser should be updated when new models are introduced in the market or unknown 
elements are added to the media header. Finally, we are currently not aware of any 
publicly available software that would allow users to consistently forge such infor- 
mation without advanced programming skills. This makes the creation of plausible 
forgeries undoubtedly a highly non-trivial task and thereby reemphasizes that file 
characteristics and metadata must not be dismissed as unreliable source of evidence 
for the purpose of file authentication per se. The Chapter is organised as follows: 
in Sects. 14.1.1 and 14.1.2 the main image and video file format specifications are 
summarized. Then, in Sect. 14.2 and 14.3 the main methods for the analysis of image 
and video file formats are described. Finally, in Sect. 14.4 the main findings of this 
research area are provided, along with some possible future works. 


14.1.1 Main Image File Format Specifications 


Multimedia technologies allow digital image generation based on several image for- 
mats. DSLR cameras and proficient acquisition devices can store images in uncom- 
pressed formats (e.g., DNG, CR2, BMP). When necessary, lossless compression 
schemes (e.g., PNG) can be applied to reduce the image size without impacting its 
quality. Furthermore, lossy compression schemes (e.g., JPEG, HEIF) are available 
to strongly reduce image size with minimal impact on visual quality. 

Nowadays, most of the devices on the market and social media platforms generate 
and store JPEG images. For this reason, most of the forensic literature focuses on this 
image type. The JPEG standard itself JPEG (1992) defines the pixel data encoding 
and decoding, while the full-featured file format implementation is defined in the 
JPEG Interchange Format (JIF) ISO/IEC (1992). This format is quite complicated, 
and some simpler alternatives are generally preferred for handling and encapsulating 
JPEG-encoded images: the JPEG File Interchange Format (JFIF) Eric Hamilton 
(2004) and the Exchangeable Image File Format (Exif) JEITA (2002), both built on 
JIF. 

Each of the formats defines the basic structure of JPEG files as a set of marker 
segments, either mandatory or optional. An identifier of the marker id indicates 
the beginning of each marker segment. Abbreviations for marker ids are denoted 
by 16 bit short values starting with OxFF. Marker segments can encapsulate either 
data compression parameters (e.g., in DQT, DHT, or SOF) or application-specific 
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metadata, like thumbnails or EXIF metadata (e.g., in APPO(JFIF) or APP1(EXIF)). 
Additionally, some markers consist of the marker id only and indicate state transitions 
necessary to parse the file format. For example, SOI and EOI indicate the start and end 
of a JPEG file, requiring all other markers to be placed between these two mandatory 
markers. 

The JIF standard also allows application markers (APPn), populated with entries 
in the form of key-value pairs. The values can vary from human-readable strings to 
complex structures like binary data. The APPO segment defines pixel densities and 
pixel ratios, and an optional thumbnail of the actual image can be placed here. In 
APP1, the EXIF standard enables cameras to save a vast range of optional information 
(EXIF-Tags) related to the camera’s photographic settings when and where the image 
was taken. The information are split into five main image file directories (IFDs): (i) 
Primary; (ii) Exif; (iii) Interoperability; (iv) Thumbnail; and (v) GPS. However, the 
content of each IFD is customizable by camera manufacturers, and the EXIF standard 
also allows for the creation of additional IFDs. Other metadata are customizable 
within the file header, such as XMP and IPTC that provide additional information 
like copyright or the image’s editing history. Furthermore, image processing software 
can introduce a wide variety of marker segment sequences. 

These differences in the file formats and the not strictly standardization of several 
sequences may expose forensic traces within the image file container Thomas Gloe 
(2012). Indeed, not all segments are mandatory and different combinations thereof 
can occur. Some segments can appear multiple times (e.g., quantization tables can 
be either encapsulated in one single or multiple separate DQT segments). Further- 
more, the sequence of segments is generally not fixed (with some exceptions) and 
customizable. For example, JPEG thumbnails exist in different segments and employ 
their own complete sequence of marker segments. Eventually, arbitrary data can exist 
after EOI. 

In the next sections, we report the main technologies and the results achieved 
based on the technical analysis of these formats. 


14.1.2 Main Video File Format Specifications 


When a camera acquires a digital video, image sequence processing and audio 
sequence processing are performed in parallel. After compression and synchro- 
nization, the streams are encapsulated in a multimedia container, simply called a 
video container from now on. There are several compression standards; however, 
H264/AVC or MPEG-4 Part 10 International Telecommunication Union (2016), and 
H265/HEVC or MPEG-H Part 2 International Telecommunication Union (2018) are 
the most relevant since most mobile devices implement them. Video and audio tracks, 
plus additional information (such as metadata and sync information), are then encap- 
sulated in the video container based on specific format standards. In the following, 
we describe the two most adopted formats. 
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MP4-like videos 


Most smartphones and compact cameras output videos in mp4, mov, or 3gp format. 
This video packaging refers to the same standard, ISO/IEC 14496 Part 12 ISO/IEC 
(2008), that defines the main features of MP4 File Format ISO/IEC (2003) and MOV 
File Format Apple Computer, Inc (2001). The ISO Base format is characterized by 
a sequence of atoms or boxes. A unique 4-byte code identifies each node (atom) 
and each atom consists of a header and a data box, and possibly nested atoms. As 
an example, we report in Fig. 14.1 a typical structure of an MP4-like container. The 
file type box, ft yp, is a four-letter code that is used to identify the type of encod- 
ing, the compatibility, or the intended usage of a media file. According to the latest 
ISO standards, it is considered a semi-mandatory atom, i.e., the ISO expects it to be 
present and explicit as soon as possible in the file container. In the example given in 
Fig. 14.1 (a), the fields of the f typ descriptor explain that the video file is MP4-like 
and it is compliant to the MP4 Base Media v1 [ISO 14496-12:2003] (here isom) 
and the 3GPP Media (.3GP) Release 4 (here 3gp4) specifications.! The movie box, 
moov, is a nested atom containing the metadata needed to decode the data stream, 
which is embedded in the following mdat atom. It is important to note that moov 
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may contain multiple trak box instances, as shown in Fig. 14.1. The trak atom is 
mandatory, and its numerosity depends on the number of streams included in the file; 
for example, if the video contains a visual-stream and an audio-stream, there will be 
two independent trak atoms. A more detailed description of these structures can 
be found in Carlos Quinto Huamán et al. (2020). 


AVI videos 


Audio video interleave (AVI) is a container format developed by Microsoft in 1992 for 
Windows software Microsoft. Avi riff file (1992). AVI files are based on the RIFF 
(resource interchange file format) document format consisting of a RIFF header 
followed by zero or more lists and chunks. 

A chunk is defined as ckID, ckSize, and ckData, where ckID identifies the data 
contained in the chunk, ckSize is a 4-byte value giving the size of the data in ckData, 
and ckData is zero or more bytes of data. 

A list consists of LIST, listSize, listType, and listData, where LIST is the literal 
code LIST, listSize is a 4-byte value giving the size of the list, listType is a 32-bit 
unsigned integer, and /istData consists of chunks or lists, in any order. 

An AVI file might also include an index chunk, which gives the location of the 
data chunks within the file. All AVI files must keep these three components in the 
proper sequence: the hdr list defining the format of the data and is the first required 
LIST chunk; the movi list containing the data for the AVI sequence; the idx/ list 
containing the index. 

Depending on the specific camera or phone model, additional lists and JUNK 
chunks may exist between the hdrl and movi lists. These optional data segments are 
either used for padding or to store metadata. The idx] chunk indexes the data chunks 
and their location in the file. 


14.2 Analysis of Image File Formats 


The first studies in the image domain considered JPEG quantization tables and image 
resolution values Hany Farid (2006); Jesse D. Kornblum (2008); H. Farid (2008). 
Just with these few features, the authors demonstrated that it is possible to link 
a probe image to a set of devices or editing software. Then, in Kee Eric (2011), 
the authors increased the set of features by also considering JPEG coding data and 
Exif metadata. These studies provided better results in terms of integrity verification 
and device identification. The main drawback of such approaches is that a user can 
easily edit the metadata’s information, limiting these methods’ reliability. Following 
studies demonstrated that the file structure also contains a lot of traces of the content’s 
history. These data are also more reliable, being much more challenging to extract 
and modify for a user than metadata. Furthermore, available editing software and 
metadata editors do not allow the modification of such low-level information Thomas 
Gloe (2012). However, like the previous ones, this method is based on a manual 
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analysis of the extracted features, to be compared with a set of collected signatures. 
In Mullan Patrick et al. (2019), the authors present the first automatic method for 
characterizing the source device, based on the analysis of features extracted from 
the JPEG header information. Finally, another branch of research demonstrated that 
the JPEG file format can be exploited to understand whether a probe image has 
been exchanged through a social network service. In the following, we describe the 
primary studies of this domain. 


14.2.1 Analysis of JPEG Tables and Image Resolution 


In Hany Farid (2006), the author shows, for the first time, that manufacturers typically 
configure the compression of devices differently according to their own needs and 
preferences. This difference, primarily manifested in the JPEG quantization table, 
can be used to identify the device that acquired a probe image. 

To carry out this analysis, the authors collect a single image, at the highest quality, 
from each of 204 digital cameras. Then, they extract the JPEG quantization table, 
noticing that 62 cameras show a unique table. 

The remaining cameras fall into equivalence classes of sizes between 2 and 28 
(i.e., between 2 and 28 devices share the same quantization tables). Usually, cameras 
that share the same tables belong to the same manufacturer. Conversely, different 
makes and models usually share the same table. These results highlight that JPEG 
quantization tables can partially characterize the source device. Thus, they effectively 
narrow the source of an image, if not to a single camera make and model, at least to 
a small set of possible cameras (on average, this set size is 1.43). 

This study was performed by also saving the images with five Photoshop versions 
(at that time, version CS2, CS, 7, 4, and 3) at each of the 13 available levels of 
compression available. As expected, quantization tables at each compression level 
were different from one another. Moreover, at each compression level, the tables were 
the same for all five versions of Photoshop. More importantly, no one of these tables 
can be found in the images belonging to the 204 digital cameras. These findings 
allow arguing the possibility to link an image to specific editing software, or at least 
to a set of possible editing tools. 

Similar results have been presented in Jesse D. Kornblum (2008). The authors 
examined images from devices (a Motorola KRZR K1m, a Canon PowerShot 540, a 
FujiFilm Finepix A200, a Konica Minolta Dimage Xg, and a Nikon Coolpix 7900) 
and images edited by several software programs such as libjpeg, Microsoft Paint, 
the Gimp, Adobe Photoshop, and Irfanview. The analysis carried out on this dataset 
shows that, although some cameras always use the same quantization tables, most of 
them use a different set of quantization tables in each image. Moreover, these tables 
usually differ according to the source or editing software. 

In H. Farid (2008), the author expands the earlier work by considering a much 
larger dataset of images and adding the image resolution to the quantization table as 
discriminating features for source identification. 


14 Integrity Verification Through File Container Analysis 369 


The author analyzes over 330 thousand Flickr images from 859 different camera 
models from 48 manufacturers. Quantization tables and resolutions allowed to obtain 
10153 different image classes. 

The resolution and quantization table were then combined to narrow the search 
criteria to identify the source camera. The analysis revealed that 2704 out of 10153 
entries (26.6%) have a unique paired resolution and quantization table. In these cases, 
the features can correctly discriminate the source device. Moreover, 37.2% of the 
classes have at most two matches, and 44.1% have at most three matches. These 
results show that the combination of JPEG quantization table and image resolution 
allows matching the source of an image to a single camera make and model or at 
least to a small set of possible devices. 


14.22 Analysis of Exif Metadata Parameters 


In Kee Eric (2011), the authors expand the previous analysis by creating a larger image 
signature for identifying the source camera with greater accuracy than only using 
quantization tables and resolution. This approach considers a set of features of the 
JPEG format, namely properties of the run-length encoding employed by the JPEG 
standard and some Exif header format characteristics. In particular, the signature 
is composed of three kinds of features: image parameters, thumbnail parameters, 
and Exif metadata parameters. The image parameters consist of the image size, 
quantization tables, and the Huffman codes; the Y, Cb, and Cr 8 x 8 quantization 
tables are represented as a vector of 192 values; the Huffman codes are specified 
as six vectors (one for the dc DCT coefficients and one for the ac DCT coefficients 
of each of the three channels Y, Cb, and Cr) storing the number of codes of length 
from | to 15. This part of the signature is then composed of 284 values: 2 image 
dimensions, 192 quantization values, and 90 Huffman codes. 

The same components are extracted from the image thumbnail, usually embedded 
in the JPEG header. Thus, the thumbnail parameters are other 284 values: 2 thumbnail 
dimensions, 192 quantization values, and 90 Huffman codes. 

A compact representation of EXIF Metadata is finally obtained by counting the 
number of entries in each of the five image file directories (IFDs) that compose the 
EXIF metadata, as well as the total number of any additional IFDs, and the total 
number of entries in each of these. It worth noticing that some manufacturers adopt 
metadata representation that does not conform to the EXIF standard, yielding errors 
when parsing the metadata. The authors also consider these errors as discriminating 
features, so the total number of parser errors, as specified by the EXIF standard, is 
also stored. Thus, they consider 8 values from the metadata: 5 entry values from the 
standard IFDs, 1 for the number of additional IFDs, 1 for the number of entries in 
these additional IFDs, 1 for the number of parser errors. 

The image signature is then composed of 284 header features extracted from the 
full resolution image, 284 header features from the thumbnail image, and another 8 
from the EXIF metadata, for a total of 576 features. These signatures are extracted 
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Table 14.1 The percentage of camera configurations with an equivalence class size from | to 5. 
Each row corresponds to different subsets of the complete signature 


Equivalence class size 

1 2 3 4 5 
Image 12.9% 7.9% 6.2% 6.6% 3.4% 
Thumbnail 1.1% 1.1% 1.0% 1.1% 0.7% 
EXIF 8.8% 5.4% 4.2% 3.2% 2.6% 
Image+thumbnail =| 24.9% 15.3% 11.3% 7.9% 3.7% 
All 69.1% 12.8% 5.7% 4.0% 2.9% 


from the image files and analyzed to verify if they can assign the image to a given 
camera make and model. 

Tests are carried out on over 1.3 million Flickr images, acquired by 773 camera 
and cell phone models, belonging to 33 different manufacturers. 

These images span 9, 163 different distinct pairings of the camera make, model, 
and signature, referred to as a camera configuration. It is worth noting that each 
camera model can produce different signatures based on the acquisition resolutions 
and quality settings. Indeed, in the collected dataset, we find an average of 12 different 
signatures for each camera make and model. 

The camera signature analysis on the above collection shows that the image, 
thumbnail, and EXIF parameters are not distinctive individually. However, when 
combined, they provide a highly discriminative signature. This result indicates that 
the three classes of parameters are not highly correlated, and hence their combination 
improves overall distinctiveness, as show in Table 14.1. 

Eventually, the signature was tested on images edited by Adobe Photoshop (ver- 
sions 3, 4, 7, CS, CS2, CS3, CS4, CS5 at all qualities), by exploiting, in this case, 
image and thumbnail quantization tables and Huffman codes only. No overlaps were 
found between any Photoshop version/quality and camera manufacturer. The Photo- 
shop signatures, each residing in an equivalence class of size 1, are unique. This means 
that any photo-editing with Photoshop can be easily and unambiguously detected. 


14.2.3 Analysis of the JPEG File Format 


In Thomas Gloe (2012), the author makes a step forward with respect to the previous 
studies. They analyze the JPEG file format structure to identify new characteristics 
that are more discriminating than JPEG metadata only. In this study, a subset of 
images extracted from the Dresden Dataset (Thomas Gloe and Rainer Bahme 2014) 
was used. In particular, 4, 666 images belonging to the JPEG scenes (a part of the 
dataset built to specifically study model-specific JPEG compression algorithms) and 
16, 956 images of natural scenes were considered. Images were captured with one 


14 Integrity Verification Through File Container Analysis 371 


device of each available camera model while iterating over all combinations of com- 
pression, image size, and flash settings) Overall, they considered images acquired by 
73 devices from 26 camera models of 14 different brands. A part of these authentic 
images was re-saved with each of the eight investigated image processing software 
(ExifTool, Gimp, IrfanView, Jhead, cjpeg, Paint.NET, PaintShop, and Photoshop), 
with all available combinations of JPEG compression settings. Finally, they obtained 
and analyzed more than 32, 000 images. 

The study focuses firstly on analyzing the sequence and occurrence of marker 
segments in the JPEG data structures. The main findings of the author are that: 


1. some optional segments can occur with different combinations; 

2. segments can appear multiple times; 

3. the sequence of segments is generally not fixed, with the exception of some 
required combinations; 

4. JPEG thumbnails exist in different segments and employ their complete sequence 
of marker segments; 

5. arbitrary data can exist after the marker end of image (EOI) of the main image. 


The markers’ analysis allows to distinguish between pristine and edited images cor- 
rectly. Furthermore, none of the investigated software allows to recreate the sequence 
of markers consistently. 

The author also considered the sequence of EXIF data structures that store cam- 
era properties and image acquisition settings. Interestingly, they found differences 
between digital cameras, image processing software, and metadata editors. In par- 
ticular, the following four characteristics appear to be relevant: 


1. the byte order is different between camera models (and the default setting 

employed by ExifTool); 

2. sequences of image file directories (IFDs) and corresponding entries (including 
tag, data type and often also offsets) appear constant in images acquired with the 
same model and differ between different sources; 

. some manufacturers use different data types for the same entry; 

4. raw values of rational types differ between different models while resulting in the 

same interpreted values (e.g., 200/10 or 20/1). 


W 


These findings allow to conclude that the analysis of EXIF data structures can be 
useful to discriminate between authentic and edited images. 


14.2.44 Automatic Analysis of JPEG Header Information 


Allprevious works analyzed images acquired by digital cameras, whereas today, most 
of the visual content is generated by smartphones. These two classes of devices are 
somewhat different. While traditional digital cameras typically have a fixed firmware, 
modern smartphones keep updating their acquisition software and operating system, 
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making it harder to achieve a unique characterization of device models based on the 
image file format. 

In Mullan Patrick et al. (2019), the authors performed a large-scale study on Apple 
smartphones to characterize the source device based on JPEG header information. 
The results confirm that identifying the hardware is much harder for smartphones 
than for traditional cameras, while we can identify the operating system version and 
selected apps. 

The authors exploit Flickr images following the rules adopted in Kee Eric (2011). 
Starting from a crawling of one million images, they filtered the downloaded media 
to remove edited images, obtaining at the end a dataset of 432, 305 images. All 
images belong to Apple devices, including all models comprised between iPhone 4 
and iPhone X, between iPad Air and iPad Air 2, with operating system versions from 
iOS 5 to iOS 12. 

They consider two families of features: the number of entries in the image 
file directories (IFDs) and the quantization tables. The selected directories include 
the standard IFDs “ExifIFD”, “IFDO”, “IFD1”, “GPS”, and the special directories 
“ICC_Profile” and “MakerNotes”. The Y and Cb quantization tables are concate- 
nated, and their hash is computed, obtaining one alphanumeric number, called QC- 
Table. Differently from Kee Eric (2011), the Huffman tables were not considered 
since, in the collected database, they are identical in the vast majority of cases. For 
similar reasons, they also discarded the sequence of JPEG syntax elements studied 
by Thomas Gloe (2012). 

First, the authors verified the modification of the header information for Apple 
smartphones over time. They showed that image metadata change more often than 
compression matrices, and that different hardware platforms with same operating 
system exhibit very similar metadata information. These findings highlight that it is 
more complicated to analyse smartphones than digital cameras. 

After that, the authors design an algorithm for determining the smartphone hard- 
ware model from the header information in an automatic way. This is the first algo- 
rithm designed to perform an automatic source identification on images through file 
format information. The classification is performed with a random forest, that uses 
numbers of entries per directory and quantization tables, as described before, as 
features. 

Firstly, a random forest classifier was trained to identify seven different iPhone 
models marketed in 2017. Secondly, the same algorithm was trained to verify its 
ability to identify the operating system from header information, considering all 
versions from iOS 4 to iOS 11. In all the two cases, experiments were performed 
by considering as input features: (i) the Exif directories only, (ii) the quantization 
tables only, (iii) metadata and quantization features together. We report the achieved 
results in Table 14.2. It worth noticing that using only the quantization tables gives 
the lowest results, whereas the highest accuracy is reached with all the considered 
features. Moreover, it is easier to discriminate among operating systems than among 
hardware models. 
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Table 14.2 Overall accuracy of models and iOS version classification when using EXIF directories 
(EXIF-D), quantization matrices (QT), or both 


EXIF-D QT Both 


iPhone Models 61.9% 35.4% 65.6% 
classification 


1OS version 
Classification 


14.2.5 Methods for the Identification of Social Networks 


Several works investigated how JPEG file format modifications can be exploited to 
understand whether a probe image has been exchanged through a social network 
service. Indeed, a social network usually recompresses, resizes, and alters image 
metadata. 

Castiglione et al. Castiglione et al. (2011) made the first study on image resizing, 
compression, renaming, and metadata modifications introduced by Facebook, Badoo, 
and Google+. 

Giudice et al. Oliver Giudice et al. (2017) built a dataset of images from different 
devices, including digital cameras and smartphones with Android and iOS operating 
systems. Then, they exchanged all the images through ten different social networks. 

The considered features are a 44-dimensional vector composed of pixel width and 
height, an array containing the Exif metadata, the number of JPEG markers, the first 
32 coefficients of the luminance quantization table, and the first 8 coefficients of the 
chrominance quantization table. These features are fed to an automatic detector based 
on two K-NN classifiers and a decision tree fitted on the built dataset. The method 
can predict the social network the image belongs to and the client application with 
an accuracy of 96% and 97.69% respectively. 

Summarizing, the authors noticed that modifications left by the process in the 
JPEG file depend on both the platform (server-side) and the application used for the 
upload (client-side). 

In Phan et al. (2019), the previous method was extended by merging JPEG meta- 
data information (including quantization tables, Huffman coding tables, image size) 
with content-related features, namely the histograms of DCT coefficients. A CNN is 
trained on these features to detect an image’s multiple exchanges through Facebook, 
Flickr, and Twitter. 

The adopted dataset consists of two parts: RRSMUD (RAISE Social Multiple Up- 
Download), generated by the RAISE dataset Duc-Tien Dang-Nguyen et al. (2015), 
containing images shared over the three Social Networks; V-SMUD (VISION Social 
Multiple Up-Download), obtained from a part of the VISION dataset Dasara Shullani 
et al. (2017), shared three times via the three social networks, following different 
testing configurations. 

The experiments demonstrated that JPEG metadata’s information improves the 
performance with respect to using content-based information only. When considering 
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single-shared contents, the accuracy improves from 85.63% to 99.87% on R-SMUD 
and is in both cases 100.00% on V-SMUD. When considering images shared both 
once and twice, the accuracy improves from 43.24% to 65.91% on R-SMUD and 
from 58.82% to 77.12% on V-SMUD. When dealing with three times shared images 
also, they achieve an accuracy of 36.18% on R-SMUD and 49.72% on V-SMUD. 


14.3 Analysis of Video File Formats 


Gloe et al. proposed the first more in-depth analysis of video file containers Gloe 
Thomas et al. (2014). The authors observe that videos from different cameras and 
phone models expose different container formats and compression. This allows a 
forensic practitioner to extract this information manually and, based on a comparison 
with a reference, expose forensic findings. In Jieun Song et al. (2016) the authors 
studied 296 video files in AVI format acquired with 43 video event data recorders 
(VEDRs), and their manipulated version with 5 different video editing software tools. 
All videos were classified according to a sequence of field data structure types. This 
study showed that the field data structure represents a valid feature set to determine 
whether video editing software was used to process a probe video. However, these 
methods require the manual analysis of the video container information, thus needing 
high effort and high programming skills. Some attempts were proposed to identify 
the containers’ signature of the most relevant brands and social media platforms 
Carlos Quinto Huamán et al. (2020). Another path tried to automatize the forensic 
analysis of video file structures by verifying their multimedia stream descriptors 
with simple binary classifiers through a supervised approach David Giiera et al. 
(2019). However, in Iuliani Massimo (2019), the authors propose a way to formalize 
the MP4-like (mp4, .mov, .3gp) video file structure. A probe video is converted 
in an XML tree-based structure; then, the comparison between the parsed data and 
a reference dataset addresses both integrity verification and brand classification. A 
similar approach has been also proposed in Erik Gelbing (2021), but the experiments 
have been applied only to the scenario of brand classification. 

Both Iuliani Massimo (2019) and David Giiera et al. (2019) allow the integrity 
assessment of a digital video with negligible computational costs. A further improve- 
ment was finally proposed in Yang Pengpeng et al. (2020), where decision trees were 
employed to reduce further the computational effort required to a fixed amount of 
time, independently of the reference dataset’s size. These latest works also introduced 
an open-world scenario where the tested device was not available in the training set. 
The issue is also addressed in Raquel Ramos Lopez et al. (2020), where information 
extracted from the video container was used to cluster sets of videos by data source, 
consisting in brand, model, or social media. In the next sections, we briefly summa- 
rize the main works: (i) the first relevant paper on video container Gloe Thomas et al. 
(2014), (ii) the first video container formalization and usage Iuliani Massimo (2019), 
(iii) the most recent work for efficient video file format analysis Yang Pengpeng et al. 
(2020). 
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14.3.1 Analysis of the Video File Structure 


In Gloe Thomas et al. (2014), the authors identify the manufacturer and model- 
specific video file format characteristics and show how processing software further 
modifies those features. In particular, they consider 19 digital cameras and 14 mobile 
phones belonging to different models, and they consider from 3 to 14 videos per 
device, with all available video quality settings (e.g., frame size and frame rate). 
In Table 14.3, we report the list of these devices. Most of these digital cameras 


Table 14.3 The list of devices analyzed in the study 


Make Model Container 
Digital camera models 
Agfa DC-504, DC-733s, DC-830i, | AVI 
Sensor530s 

Agfa SensorS05-X AVI 
Canon $45,S70, A640, Ixus Is AVI 
Canon EOS-7D MOV 
Casio EX-M2 AVI 
Kodak M1063 MOV 
Minolta DiMAGE Z1 MOV 
Nikon CoolPix $3300 AVI 
Pentax Optio A40 AVI 
Pentax Optio W60 AVI 
Praktica DC2070 MOV 
Ricoh GX100 AVI 
Samsung NV15 AVI 
Mobile phone models 

Apple IPhone 4 MOV 
Benq Siemens S88 3GP 
BlackBerry 8310 3GP 
Google Nexus 7 3GP 

LG KU990 3GP, AVI 
Motorola MileStone 3GP 
Nokia 6710 3GP, MP4 
Nokia E6li 3GP, MP4 
Nokia E65 3GP, MP4 
Nokia X3-00 3GP 
Palm Pre MP4 
Samsung GT-5500i MP4 
Samsung SGH-D600 MP4 
Sony Ericsson K800i 3GP 


376 A. Piva and M. Iuliani 


store videos in AVI format, and some use Apple Quicktime MOV containers and 
compress video data using motion JPEG (MJPEG), and only three use more efficient 
codecs (DivX, Xvid, or H.264). Allthe mobile phones store video data in MOV-based 
container formats (MOV, 3GP, MP4), except for the LG KU990 camera phone, which 
supports AVI. On the contrary, the mobile phones adopt H.263, H.264, MPEG-4 or 
DivX, and not MJPEG compression. 

A subset of the camera models has also been used to generate cut short videos, 
10 seconds long, through video editing tools (FFmpeg, Avidemux, FreeVideoDub, 
VirtualDub, Yamb, and Adobe Premiere. Except for Adobe Premiere, all of them 
have allowed lossless video editing, i.e., the edited files have been saved without re- 
compressing the original stream. After that, they have analyzed the extracted data. 
This analysis reveals considerable differences between device models and software 
vendors, and some general findings are summarized here. 

The authors carry out the first analysis by grouping the devices according to the file 
structure, i.e., AVI or MP4. Original AVI videos do not share the same components. 
Both the content and the format of specific chunks may vary, depending on the 
particular Camera model. Their edited versions, with all software tools, even when 
lossless processing is applied, leave distinct traces in the file structure, which do 
not match any of the camera’s features in the dataset. It is then possible to use 
these differences to spot if a video file was edited with these tools by comparing a 
questioned file’s file structure with a device’s reference container. 

Original MP4 videos, due to the by far more complex file format, exhibit an even 
more considerable degree of source-dependent internal variations than AVI files. 
However, none of the cut videos produced through the editing tools is compati- 
ble with any source device considered in the study. Dealing with original MJPEG- 
compressed video frames, different camera models adopt different JPEG marker 
segment sequences when building MJPEG-compressed video frames. Moreover, 
content-adaptive quantization tables are generally used in a video sequence, such 
that 2914 unique JPEG luminance quantization tables in this small dataset have been 
found. On these contents, lossless video editing leaves compression settings unal- 
tered, but they introduce very distinctive artifacts in container files’ structure. These 
findings stated that videos from different digital cameras and mobile phones appear 
to employ different container formats and compression. It worth noticing that this 
study is not straightforward since the authors had to write customized file parsers 
to extract the file format information from the videos. On the other side, this diffi- 
culty is also advantageous since we are currently unaware of any publicly available 
software that consistently allows users to forge such information without advanced 
programming skills. This fact makes the creation of realistic forgeries undoubtedly 
a highly non-trivial undertaking and thereby reemphasizes that file characteristics 
and metadata must not be dismissed as unreliable sources of evidence to address file 
authentication. 
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14.3.2 Automated Analysis of mp4-like Videos 


The previous method is subjected to two main limitations: Firstly, being manual 
is time demanding. Secondly, it is prone to error since the forensic practitioner 
have to subjectively assess the finding’s value. In Iuliani Massimo (2019), an auto- 
mated method for unsupervised analysis of video file containers to assess video 
integrity and classify the source device brand. The authors developed an MP4 Parser 
library Apache (2021) that automatically extracted a video container and store it in an 
XML file. The video container is represented as a labeled tree where internal nodes are 
labeled by atoms names (e.g.,moov-2), and leaves are labeled by field-value attributes 
(e.g., @stuff: MovieBox[]). To take into account the order of the atoms, each XML- 
node is identified by a 4-byte code of the corresponding atom along with an index 
that represents the relative position with respect to the other siblings at a certain level. 


Technical Description 


A video file container, extracted from a video X, can be represented as an ordered 
collection of atoms a1, ..., @,, possibly nested. 

We can describe each atom in terms of a set of field-value attributes, as a; = 
loi (ai), <- Om; (a;)). 

By combining the two previous descriptions, the video container can be charac- 
terized by the set of field-value attributes X = (OE we 4 Wm)» each with its associated 
path py (œw), that is the ordered list of atoms to be crossed to reach o in X starting 
from the root. As an example, consider the video X, whose fragments are reported in 
Fig. 14.1. The path to reach the field-value w = @ timescale: 1000 in X is the sequence 
of atoms (ftyp-1, moov-2, mvhd-1). In this sense, we state that px(@) = py’ (o ) if 
the same ordered list of atoms is crossed to reach the field-values in the two trees, 
respectively, which is not the case in the previous example. 

In summary, the video container structure is completely described by a list of m 


field-value attributes X = (w1, eae Om), and their corresponding paths px(@)),..., 
px (Om ) . 

From now on, we will denote with X = {X1, ..., Xy} the world set of digital 
videos, and with C = (Ci, ..., C,} the set of disjoint possible origins, e.g., device 


Huawei P9, iPhone 6s, and so on. 

When a video is processed in any way (with respect to its native form), its integrity 
is compromised SWGDE-SWGIT (2017). 

More generally, given a query video X and a native reference video X’ coming 
from the same supposed device model, their container structure dissimilarities can 
be exploited to expose integrity violation evidence, as follows. 

Given two containers X = (w1, ..., @m), X '= (,, betes wn) with the same car- 
dinality? m, we define a similarity core function between two field-values as 


2 Tf the two containers have a different cardinality, we pad the smaller one with empty field-values 
to obtain the same value m. 


378 A. Piva and M. Iuliani 


L if @ = o; and px(e@i) = py (@;) 


Sloi @;) = | (14.1) 


0 otherwise 


We can easily extend the measure to the comparison between a single field-value 
wi € X and the whole X as 


1 if Jw, € X : S(@;,0,) = 1 


s (14.2) 
0 otherwise 


ly (oi) = | 


Then, the dissimilarity between X and X can be computed as the mismatching 
percentage of all field-values, i.e., 


Y Ly (i) 
mm(X, X) = 1- =)—__—_ (14.3) 
m 


and, to preserve symmetry, the degree of dissimilarity between X and X can be 
computed as 
mm(X, Xx) + mm(X , X) 


D(X, X) = z 


(14.4) 


Based on its definition, D(X, X ) € [0, 1], D(X, X) = 1 when X = X , and D(X, 
X ) = 0, when they have no elements in common. 


Experiments 


The method is tested on 31 portable devices from VISION dataset Dasara Shullani 
et al. (2017) that leads to an available collection of 578 videos in the native format 
plus their corresponding social versions ( YouTube and WhatsApp are considered). 

The metric in Eq. (14.4) is applied to determine whether a video’s integrity is 
preserved or violated. The authors consider four different scenarios of integrity vio- 
lation’: 


e exchange through WhatsApp; 

e exchange through YouTube; 

e cut without re-encoding through FFmpeg; 

e date editing through ExifTool. 

For each of the four cases, we consider the set of videos X),..., X Nc,» avail- 
able for each device C;, and we compute the intra-class dissimilarities between two 
native videos X;, X ; belonging to the same model D; ; = D(X;, X;), Vi # j based 
on Eq. (14.4). For simplicity, we denote with Doo this set of dissimilarities. Then, 
we consider the corresponding inter-class dissimilarities D; j= D(X, X i), Vi = j. 


3 See the paper for the implementation details. 
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Table 14.4 AUC values computed for all devices for each of the four attacks, and when the native 
video is compared to videos from other devices. In the case of ExifTool and Other Devices, the 
achieved min and max AUC values are represented. For all other cases, an AUC equal to 1 is always 
obtained 

Case WhatsApp YouTube FFmpeg ExifTool 


AUC 1.00 1.00 1.00 [0.64 1.00] 


between a native video X; and its corrupted version X ; obtained with the tool-t 
(WhatsApp, YouTube, ffmpeg, or Exiftool). We denote with D‘, this set of dissimi- 
larities. By applying this procedure to all the considered devices, we collected 2890 
samples for both Doo and any of the four D}; 

In most of the performed tests the maximum value achieved for D,, is lower than 
the minimum value of D, thus indicating that the two classes can be separated. 
In Table 14.4 we report the AUCs for each of the considered cases. Noticeably, 
the integrity violation through ffmpeg, YouTube, and WhatsApp, the AUC is 1 for 
all devices. These values clearly show that the exchange through WhatsApp and 
YouTube strongly compromises the container structure, and thus it is easily possible 
to detect the corresponding integrity violation. Interestingly, cutting the video using 
Jfmpeg, without any re-encoding, also results in a substantial container modification 
that we can detect. On the contrary, the date change with Exiftoolinduced on some 
containers a modification that is comparable with the observed intra-variability of 
native videos; in these cases, the method can not detect it. Fourteen devices still 
achieve unitary AUC, eight more devices yield an AUC greater than 0.8, and the 
remaining nine devices yield AUC below 0.8, with the lowest value of 0.64 obtained 
for D02 and D20 (Apple iPhone 4s and iPad mini). 

To summarize, for the video integrity verification tests, the method obtains perfect 
discrimination for videos altered by social networks or ffmpeg, while for Exiftoolit 
achieves an AUC greater than 0.82 on 70% of the considered devices. It worth also 
noticing that analyzing a video query requires less than a second. 

More detailed results are reported in the corresponding paper. The authors also 
show that this container description can be immersed in a likelihood ratio framework 
to determine which atoms and field-values are highly discriminative for specific 
classes. In this way, they also show how to classify the brand of the originating 
device Iuliani Massimo (2019). 


14.3.3 Efficient Video Analysis 


The method described in Iuliani Massimo (2019), although effective, has a linear 
computational cost since it requires checking the dissimilarity of the probe video 
with all available reference containers. As a consequence, increasing the reference 
dataset size leads to a higher computational effort. 
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In this section, we summarize the method proposed in Yang Pengpeng et al. (2020) 
that provides an efficient way to analyze video file containers independently on the 
reference dataset’s size. 

The method, called EVA from now on, is based on Decision Trees Ross Quinlan 
(1986), a non-parametric learning method used for classification problems in many 
signal processing fields. 

Their key feature is breaking down a complex decision-making process into a 
collection of more straightforward decisions. 

The process is extremely efficient since a decision can be taken by checking a 
small number of features. 

Furthermore, EVA allows both to characterize the identified manipulations and to 
provide an explanation for the outcome. 

The method is also enriched with a likelihood ratio framework designed to 
automatically clean up the container elements that only contribute to source intra- 
variability (for the sake of brevity, we do not report these details here). 

In short, EVA can identify the manipulating software (e.g., Adobe Premiere, ffm- 
peg, ...) and provide additional information related to the original content history 
such as the source device operating system. 


Technical Description 


Similarly to the previous case, we consider a video container X as a labeled tree 
where internal nodes and leaves correspond to, respectively, atoms and field-value 
attributes. It can be characterised by the set of symbols {s1, ... Sm}, where s; can be: 
(i) the path from the root to any field (value excluded), also called field-symbols; (ii) 
the path from the root to any field-value (value included), also called value-symbols. 
An example of this representation can bef: 


sı =[ftyp/ @majorBrand] 
s2 =[ftyp/ @majorBrand/isom] 


si= [moov/mvhd/ @duration] 
Si41= [Moov/mvhd/ @duration/73432] 


Overall, we denote with Q the set of all unique symbols s1,..., 5, available 
in the world set of digital video containers ¥ = {X,..., Xn}. Similarly, C = 
{C,,..., Cs} denotes a set of possible origins (e.g., Huawei P9, Apple iPhone 6s). 
Let a container X, the different structure of its symbols {s1, ..., Sm} can be exploited 
to assign the video to a specific class C,. 

For this purpose, binary decision trees Rasoul Safavian and Landgrebe (1991) are 
employed to build a set of hierarchical decisions. In each internal tree node the input 
data is tested against a specific condition; the test outcome is used to select a child 
as the next step in the decision process. Leaf nodes represent decisions taken by the 


4 Note that @ is used to identify atom parameters, and root is used for visualization purposes. 
Still, it is not part of the container data. 
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algorithm. More specifically, EVA adopts the growing-pruning-based Classification 
And Regression Trees (CART) Leo Breiman (2017). 

Given the size of unique symbols |Q| = M, a video container X is converted 
into a vector of integers X — (x; ... xm) where x; is the number of times that s; 
occurs into X. This approach is inspired by the bag-of-words representation Hinrich 
Schütze et al. (2008) which is used to reduce variable-length documents to a fixed- 
length vectorial representation. 

For the sake of example, we consider two classes: 


C,,: 10S devices, native videos; 
C,: 1OS devices, dates modified through Exiftool (see Sect. 14.3.3 for details). 


With these settings, the decision tree highlights that the usage of Exiftool can be 
simply decided by looking, for instance, at the presence of moov/udta/XMP_/ 
@stuff. 


Experiments 


Tests were performed on 34 smartphones of 10 different brands. Over a thousand tam- 
pered videos were tested, considering both automated (ffmpeg, Exiftool) and manual 
editing operations (Kdenlive, Avidemux and Adobe Premiere). The manipulations 
mainly include cut with and without re-encoding, speed up, and slow motion. 

Achieved videos (both native and manipulated) were also exchanged through 
YouTube, Facebook, Weibo, and TikTok. 

All tests were performed by adopting an exhaustive leave-one-out cross-validation 
strategy: the dataset is partitioned in 34 subsets, each one of them containing pristine, 
manipulated, and social-exchanged videos belonging to a specific device. In this 
way, test accuracy collected after each iteration is computed on videos belonging 
to an unseen device. When compared to state of the art, EVA exposes competitive 
effectiveness at the lowest computational effort (see 14.5). 

In comparison with David Giiera et al. (2019), it achieves higher accuracy. We can 
reasonably attribute this performance to their use of a smaller feature space; indeed, 
only a subset of the available pieces of information are extracted without considering 
their position within the video container. On the contrary, EVA features also include 
the path from the root to the value, thus providing a higher discriminating power. 
Indeed, this approach distinguishes between two videos where the same information 


Table 14.5 Comparison of our method with the state of the art. Values of accuracy and time are 
averaged over the 34 folds 


Balanced accuracy Training time Test time 


Guera et al. David 0.67 347s < Ils 
Güera et al. (2019) 


luliani et al. Iuliani 0.85 
Massimo (2019) 


EVA 0.98 


N/A 8s 


31s < ls 


382 A. Piva and M. Iuliani 


is stored in different atoms. When compared with Iuliani Massimo (2019), EVA is 
capable of obtaining better classification performance with a lower computational 
cost. In Iuliani Massimo (2019), O(N) comparisons are required since all the N 
reference-set examples must be compared with a tested video; on the contrary, the 
cost for a decision tree analysis is O(1) since the output is reached in a constant 
number of steps. Furthermore, EVA allows a simple explanation for the outcome. 
For the sake of example, we report in Fig. 14.2a a sample tree from the integrity 
verification experiments: the decision is taken by up to four checks, just based on 
the presence of the symbols ftyp/@minorVersion = 0,uuid/@userType, 
moov/udta/XMP_andmoov/udta/auth. Then, a single decision tree can han- 
dle both easy- and hard-to-classify cases at the same time. 


Manipulation Characterization 
EVA is also capable of identifying the manipulating software and the operating system 
of the originating device. More specifically, three main questions were posed: 


A Software identification: Is the proposed method capable of identifying the soft- 
ware used to manipulate a video? If yes, is it possible to identify the operating 
system of the original video? 

B Integrity Verification on Social Media: Given a video from a social media 
platform (YouTube, Facebook, TikTok or WeiBo), can we determine whether 
the original video was pristine or tampered? 

C Blind scenario: Given a video that may or may not has been exchanged through 
a social media platform, is it possible to retrieve some information on the video 
origin? 


In the Software Identification, the experiment comprises the following classes: 
“native” (136 videos), “Avidemux” (136 videos), “Exiftool” (136 videos), “ffmpeg” 
(680 videos), “Kdenlive” (136 videos), and “Premiere” (136 videos). As shown in 
Table 14.6, EVA obtains a balanced global accuracy of 97.6% with slightly lower 
accuracy in identifying ffmpeg with respect to the other tools. These wrong classifi- 
cations are reasonable because ffmpeg library is used by other software and, internally, 
by Android devices. 

The algorithm maintains a high discriminative power even when trained to identify 
the originating operating system (Android vs. iOS). 

In a few cases, EVA wrongly classifies only the source’s operating system. This 
mistake happens explicitly in the case of Kdenlive and Adobe Premiere. At the same 
time, these programs are always correctly identified. This result indicates that the 
container’s structure of videos saved by Kdenlive and Adobe Premiere is probably 
reconstructed in a software-specific way. More detailed performance is reported in 
the related paper. 

With regard to Integrity Verification on Social Media, EVA was also tested 
on YouTube, Facebook, TikTok, and Weibo videos to determine whether they were 
pristine or manipulated before the upload (Table 14.7). 

Results highlight poor performance due to the social media transcoding process 
that flattens the containers almost independently on the video origin. For example, 
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count(root/ftyp/@minorVersion=0) < 1 
samples = 1320 


ee False 
samples = 924 count(root/uuid/@userType) < 1 
class = Tampered samples = 396 
count(root/moov/udta/XMP_/@stuff) < 1 samples = 211 
samples = 185 class = Tampered 
count(root/moov/udta/auth/@stuff) < 1 samples = 52 
class = Tampered 


samples = 133 


samples = 128 samples = 5 
class = Native class = Native 


(a) Integrity verification classifier. 


count(root/moov/trak/mdia/minf/stbl/stsd/mp4a/esds/@syncExtensionType=0) < 1 
samples = 6600 


False 


samples = 5280 class = Youtube 


eG 


typ/@minorVersion=0) < 1 samples = 132 
samples = 5148 class = premiere 


ount(root/ftyp/@compatibleBrand_1=mp42) < ') samples = ae) 


count(root/moov/udta/XMP /@stuff) < 1 samples = 79 
samples = 185 class = exiftool 


mdia/minf/stbl/stsd/avc1/avcC/@avcLevelIndication=31) < ") eS = a) 


samples = 133 class = exiftool 


(b) Detail of a blind scenario classifier. 


Fig. 14.2 Pictorial representation of some of the generated decision trees 
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Table 14.6 Confusion matrix under the software identification scenario for native contents (Na), 
Avidemux (Av), Exiftool (Ex), ffmpeg (Ff), Kdenlive (Kd), Premiere (Pr) 


Na Av Ex Ff Kd 
Na 0.97 - 0.03 _ _ 
Av _ 1.00 _ _ _ 
Ex 0.01 _ 0.99 _ _ 
Ff _ 0.01 _ 0.90 0.09 _ 
Kd - _ _ _ 1.00 _ 
Pr _ _ _ _ _ 1.00 


Table 14.7 Performance achieved for integrity verification on social media contents. We report for 
each social network the obtained accuracy, true positive rate (TPR), and true negative rate (TNR). 
All these performance measures are balanced 


Accuracy TNR TPR 
Facebook 0.76 0.40 0.86 
TikTok 0.80 0.51 0.75 
Weibo 0.79 0.45 0.82 
YouTube 0.60 0.36 0.74 


after YouTube transcoding, videos produced by Avidemux and by Exiftool are shown 
tohave exactly the same containerrepresentation. We do not know how the considered 
platforms process the videos due to the lack of public documentation, but we can 
assume that uploaded videos undergo custom/multiple processing. Indeed, social 
media videos need to be viewable on a great range of platforms and thus need to be 
transcoded to multiple video codecs and adapted for various resolutions and bitrates. 
Thus, it seems plausible that those operations could discard most of the original 
container structure. 

Eventually, in the Blind Scenario, the authors show that we can remove the prior 
information (whether the video belongs to social media or not) without degrading 
detection performances. 

In summary, EVA can be considered an efficient forensic method for checking 
video Integrity. H manipulation is detected, EVA can also identify the editing software 
and, in most cases, the video source device's operating system. 


14.4 Concluding Remarks 


In this chapter, we described how features extracted from image and video file con- 
tainers can be exploited for integrity verification. In some cases, these features also 
provide insights into the probe’s previous history. In several cases, they allow the 
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characterization of the source device's brand or model. Similarly, they allow deriving 
that the content was exchanged through a particular social network. 

This kind of information shows several advantages with respect to content-based 
traces. Firstly, it is usually much faster to be extracted and analyzed than content- 
based ones. It also requires an almost fixed computational effort, independently on the 
media size and resolution (this characteristic is particularly relevant for videos). Sec- 
ondly, it is also much more challenging to perform anti-forensics operations on these 
features since the structure of an image, or video file is overly complicated. Thus, it 
is not straightforward to mimic a native file container. There are, however, several 
drawbacks to be taken into account. First of all, current editing tools and metadata 
editors cannot access such low-level information. It is required to design a proper 
parser to extract the information stored in the file container. Moreover, these features’ 
effectiveness strongly depends on the availability of updated reference datasets of 
native and manipulated contents. Eventually, these features cannot discriminate the 
full history of a media since some processing operations, like uploading to a social 
network, tend to remove the traces left by previous steps. It is also worth mentioning 
that the forensic analysis of social media content is fragile since the providers’ set- 
tings keep changing. This fact makes collected data and achieved results outdated in 
a short time. Furthermore, collecting large datasets can be challenging depending on 
the social media provider’s upload and download protocol. Future work should be 
devoted to the fusion of content-based and format-based features since it is expected 
that the exploitation of these two domains can provide a better performance, as wit- 
nessed in Phan et al. (2019), where images exchanged on multiple social networks 
were studied. 
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Chapter 15 A) 
Image Provenance Analysis cilo: 


Daniel Moreira, William Theisen, Walter Scheirer, Aparna Bharati, 
Joel Brogan, and Anderson Rocha 


The literature of multimedia forensics is mainly dedicated to the analysis of sin- 
gle assets (such as sole image or video files), aiming at individually assessing their 
authenticity. Different from this, image provenance analysis is devoted to the joint 
examination of multiple assets, intending to ascertain their history of edits, by eval- 
uating pairwise relationships. Each relationship, thus, expresses the probability of 
one asset giving rise to the other, through either global or local operations, such as 
data compression, resizing, color-space modifications, content blurring, and content 
splicing. The principled combination of these relationships unveils the provenance 
of the assets, also constituting an important forensic tool for authenticity verification. 
This chapter introduces the problem of provenance analysis, discussing its impor- 
tance and delving into the state-of-the-art techniques to solve it. 


15.1 The Problem 


Consider a questioned media asset, namely a query (such as a digital image whose 
authenticity is suspect), and a large corpus of media assets (such as the Internet). 
Provenance analysis comprises the problem of (i) finding, within the available corpus, 
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(a) (b) (c) 


Fig. 15.1 Images that became viral on the web in the last decade, all with unknown sources. 
In a, the claimed world’s first dab, supposedly captured during WWII. The dabbing soldier was 
highlighted, to make him more noticeable. In b, the claimed proof of an unlikely friendship between 
the Notorious B.I.G., on the left, and Kurt Cobain, on the right. In c, a photo of a supposed NATO 
meeting aimed at supporting a particular political narrative 


Fig. 15.2 A reverse image search of Fig. 15.1a leads to the retrieval of these two images, among 
others. Figure 15.1a is probably the result of cropping (a), which in turn is a color transformation 
of (b). The dabbing soldier was highlighted in a, to make him more noticeable. Image (b) is, thus, 
the source, comprising a behind-the-scenes picture from Dunkirk (2017) 


the assets that directly and transitively share content with the query, as well as of (ii) 
establishing the derivation and content-donation processes that explain the existence 
of the query. Take, for example, the three queries depicted in Fig. 15.1, which became 
viral images in the last decade. Reasons for their virality range from the popularity of 
harmless pop-culture jokes and historical oddities (such as in the case of Fig. 15.1a, 
b) to interest in more critical political narratives and agendas (such as in Fig. 15.1c). 
Provenance analysis offers a principled and automated framework to debunk such 
media types by retrieving and associating other assets that help to elucidate their 
authenticity. 

To get a glimpse of the expected outcome of performing provenance analysis, 
one could manually submit each one of the three queries depicted in Fig. 15.1 to a 
reverse image search engine, such as TinEye (2021), and try to select and associate the 
retrieved images to the queries by hand. This process is illustrated through Figs. 15.2, 
15.3, and 15.4. Based on Fig. 15.2, for instance, one can figure out that the claimed 
world’s first dab, supposedly captured during WWIL, is a crop of Fig. 15.2a, which 
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(b) 


Fig. 15.3 A reverse image search of Fig. 15.1b leads to the retrieval of these two images, among 
others. Figure 15.1b is probably a composition of (a) and (b), where a donates the background, 
while b donates the Notorious B.I.G. on his car’s seat. Cropping and color corrections are also 
performed to complete the forgery 


(b) 


Fig. 15.4 A typical reverse image search of Fig. 15.1c leads to the retrieval of a but not b. Image 
b was obtained through one of the content retrieval techniques presented in Sect. 15.2 (context 
incorporation). Figure 15.1c is, thus, a composition of a and b, where a donates the background 
and b donates Vladimir Putin. Cropping and color corrections are also performed to complete the 
forgery 


in turn is a color modified version of Fig. 15.2b, a well-known picture of the cast of 
the Hollywood movie Dunkirk (2017). 

Figure 15.3, in turn, helps to reveal that Fig. 15.1b is actually a forgery (com- 
position), where Fig. 15.3a probably serves as the background for the splicing of 
the Notorious B.I.G. on his car’s seat, taken from Fig. 15.3a. In a similar fashion, 
Fig. 15.4 points out that Fig. 15.1c is also a composition, this time using Fig. 15.4a 
as background, and Fig. 15.4b as the donor of the portrayed individual. In this par- 
ticular case, Fig. 15.4b is not easily found by a typical reverse image search. To do 
so, we had to perform context incorporation, a content retrieval strategy adapted to 
provenance analysis that is explained in Sect. 15.2. 

In the era of misinformation and “fake news”, there is a symptomatic crisis of trust 
in the media assets shared online. People are aware of editing software, with which 
even unskilled users can quickly fabricate and manipulate content. Although many of 
these manipulations have benign purposes (no, there is nothing wrong with the silly 
memes you share), some content is generated with malicious intent (general public 
deception and propaganda), and some modifications may undermine the ownership 
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of the media assets. Under this scenario, provenance analysis reveals itself as a 
convenient tool to expose the provenance of the assets, aiding in the verification of 
their authenticity, protecting their ownership, and restoring credibility. 

In the face of the massive amount of data produced and shared online, though, 
there is no space for performing provenance analysis manually, such as formerly 
described. Besides the need for particular adaptations such as context incorporation 
(see Sect. 15.2), provenance analysis must also be performed at scale automati- 
cally and efficiently. By combining ideas from image processing, computer vision, 
graph theory, and multimedia forensics, provenance analysis constitutes an interest- 
ing interdisciplinary endeavor, into which we delve into in detail in the following 
sections. 


15.1.1 The Provenance Framework 


Provenance analysis can be executed at scale and fully automated by following a basic 
framework that involves two stages. Such a framework is depicted in Fig. 15.5. As 
one might observe, the first stage is always related to the activity of content retrieval, 
which incorporates a questioned media asset (a.k.a. the query) and a corpus of media 
assets to retrieve a selection of assets of interest that are related to the query. 

Figure 15.6 depicts the expected outcome of content retrieval for Fig. 15.1b as the 
query. In the case of provenance analysis, the content retrieval activity must retrieve 
not only the objects that directly share content with the query but also transitively. 
Take, for example, images 1, 2, and 3 within Fig. 15.6, which all share some visual 
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Fig. 15.5 The provenance framework. Provenance analysis is usually executed in two stages. 
Starting from a given query and a corpus of media assets, the first stage is always related to the 
content retrieval activity, which is herein explained in Sect. 15.2. The first stage’s output is a list 
of assets of interest (content list), which is fed to the second stage. The second stage, in turn, may 
either comprise the activity of graph construction (discussed within Sect. 15.3) or the activity of 
content clustering (discussed within Sect. 15.4). While the former activity aims at organizing the 
retrieved assets in a provenance graph, the latter focuses on establishing meaningful asset clusters 
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Fig. 15.6 Content retrieval example for a given query. The desired output of content retrieval for 
provenance analysis is a list of media assets that share content with the query, either directly (such as 
objects 1, 2, and 3) or transitively (object 4, through object 1). Methods to perform content retrieval 
are discussed in Sect. 15.2 


elements with the query. Image 4, however, has nothing in common with the query. 
But its retrieval is still desirable because it shares visual content with image 1 (the 
head of Tupac Shakur, on the right), hence it is related to the query transitively. 
Techniques to perform content retrieval for the provenance analysis of images are 
presented and discussed in Sect. 15.2. 

Once a list of related media assets is available, the provenance framework’s typical 
execution moves forward to the second stage, which has two alternate modes. The 
first, provenance graph construction, aims at computing the directed acyclic graph 
whose nodes individually represent the query and the related media assets and whose 
edges express the edit and content-donation story (e.g., cropping, blurring, splicing, 
etc.) between pairs of assets, linking seminal to derived elements. It, thus, embodies 
the provenance of the objects it contains. Figure 15.7 provides an example of prove- 
nance graph, constructed for the media assets depicted in Fig. 15.6. As shown, the 
query is probably a crop of image 1, which is a composition. It uses image 2 as back- 
ground and splicing objects from images 3 and 4 to complete the forgery. Methods 
to construct provenance graphs from sets of images are presented in Sect. 15.3. 

The provenance framework may be changed by replacing the second stage of 
graph construction with a content clustering approach. This setup is sound in the 
study of contemporary communication on the Internet, such as exchanging memes 
and the reproduction of viral movements. Dabbing (see Fig. 15.1a), for instance, is an 
example of this phenomenon. In these cases, the users’ intent is not limited to retriev- 
ing near-duplicate variants or compositions that make use of a given query. They are 
also interested in the retrieval and grouping of semantically similar objects, which 
may greatly vary in appearance to elucidate the provenance of a trend. This situation 
is represented in Fig. 15.8. Since graph construction may not be the best approach to 
organize semantically similar media assets obtained during content retrieval, content 
clustering reveals itself as an interesting option, as we discuss in Sect. 15.4. 
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Fig. 15.7 Provenance graph example. This graph is constructed using the media assets retrieved 
in the example given through Fig. 15.6. In a nutshell, it expresses that the query is probably a crop 
of image 1, which in turn is a composition forged by combining the contents of images 2, 3, and 4. 
Methods to perform provenance graph construction are presented in Sect. 15.3 


1 2 3 4 


Fig. 15.8 Content retrieval of semantically similar assets. Some applications might aim at identify- 
ing particular behaviors or gestures that are depicted in a meme-style viral query. That might be the 
case for someone trying to understand the trend of dabbing, which is reproduced in all of the images 
above. In such a case, provenance content retrieval might still be helpful to fetch related objects. 
Graph construction, however, may be rendered useless due to the dominant presence of semantically 
similar objects rather than near-duplicates or compositions. In such a situation, we propose using 
content clustering as the second stage of the provenance framework, which we discuss in Sect. 15.4 


15.1.2 Previous Work 


Early Work: De Rosa et al. (2010) have mentioned the importance of considering 
groups of images instead of single images while performing media forensics. Start- 
ing with a set of images of interest, they have proposed to express pairwise image 
dependencies through the analysis of the mutual information between every pair of 
images. By combining these dependencies into a single correlation adjacency matrix 
for the entire image set, they have suggested the generation of a dependency graph, 


15 Image Provenance Analysis 395 


whose edges should be individually evaluated as being in accordance with a partic- 
ular set of image transformation assumptions. These assumptions should presume 
a known set of operations, such as rotation, scaling, and JPEG compression. Edges 
unfit to these assumptions should be removed. 

Similarly, Kennedy and Chang (2008) have explored solutions to uncover the 
processes through which near-duplicate images have been copied or manipulated. 
By relying on the detection of a closed set of image manipulation operations (such 
as cropping, text overlaying, and color changing), they have proposed a system 
to construct visual migration maps: graph data structures devised to express the 
parent-child derivation operations between pairs of images, being equivalent to the 
dependency graphs proposed in De Rosa et al. (2010). 


Image Phylogeny Trees: Rather than modeling an exhaustive set of possible oper- 
ations between near-duplicate images, Dias et al. (2012) designed and adopted a 
robust image similarity function to compute a pairwise image similarity matrix M. 
For that, they have introduced an image similarity calculation protocol to generate 
M, which is widely used across the literature, including provenance analysis. This 
method is detailed in Sect. 15.3. To obtain a meaningful image phylogeny tree from 
M, to represent the evolution of the near-duplicate images of interest, they have intro- 
duced oriented Kruskal, a variation of Kruskal’s algorithm that extracts an oriented 
optimum spanning tree from M. As expected, phylogeny trees are analogous to the 
aforementioned dependency graphs and visual migration maps. 

In subsequent work, Dias et al. (2013) have reported a large set of experiments with 
their methodology in the face of a family of six possible image operations, namely 
scaling, warping, cropping, brightness and contrast changes, and lossy compression. 
Moreover, Dias et al. (2013) have also explored the replacement of oriented Kruskal 
with other phylogeny tree-building methods, such as oriented Prim and Edmond’s 
optimum branching. 

Melloni et al. (2014) have contributed to the topic of image phylogeny tree 
reconstruction by investigating ways to combine different image similarity metrics. 
Bestagini et al. (2016) have focused on the clues left by local image operations (such 
as object splicing, object removal, and logo insertion) to reconstruct the phylogeny 
trees of near-duplicates. More recently, Zhu and Shen (2019) have proposed heuris- 
tics to improve phylogeny trees by correcting local image inheritance relationship 
edges. Castelletto et al. (2020), in turn, have advanced the state of the art by training a 
denoising convolutional autoencoder that takes an image similarity adjacency matrix 
as input and returns an optimum spanning tree as the desired output. 


Image Phylogeny Forests: All the techniques mentioned so far were conceived to 
handle near-duplicates. Aiming at providing a solution to deal with semantically 
similar images, Dias et al. (2013) have extended the oriented Kruskal method pro- 
posed in Dias et al. (2012) to what they named the automatic oriented Kruskal. This 
technique is an algorithm to compute a family of disjoint phylogeny trees (hence a 
phylogeny forest) from a given set of near-duplicate and semantically similar images, 
such that each disjoint tree describes the relationships of a particular group of near- 
duplicates. 
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In the same direction, Costa et al. (2014) have provided two extensions to the 
optimum branching algorithm proposed in Dias et al. (2013), namely automatic 
optimum branching and extended automatic optimum branching. Both solutions are 
based on the automatic calculation of branching cut-off execution points. Alter- 
natively, Oikawa et al. (2015) have proposed the use of clustering techniques for 
finding the various disjoint phylogeny trees. Images coming from the same source 
(near-duplicates) should be placed in the same cluster, while semantically similar 
images should be placed in different clusters. 

Milani et al. (2016), in turn, have suggested relying on the estimation of the 
geometric localization of captured viewpoints within the images as a manner to dis- 
tinguish between near-duplicates (which should share viewpoints) from semantically 
similar objects (which should present different viewpoints). Lastly, Costa et al. (2017) 
have introduced solutions to improve the creation of the pairwise image similarity 
matrices, even in the presence of semantically similar images and regardless of the 
graph algorithm used to construct the phylogeny trees. 


Multiple Parenting: Previously mentioned phylogeny work did not address the crit- 
ical scenario of image compositions, in which objects from one image are spliced 
into another. Aiming at dealing with these cases, de Oliveira et al. (2015) have 
modeled every composition as the outcome of two parents (one donor, which pro- 
vides the spliced object, and one host, which provides the composition background). 
Extended automatic optimum branching, proposed in Costa et al. (2014), should 
then be applied for the reconstruction of ideally three phylogeny trees: one for the 
near-duplicates of the donor, one for the near-duplicates of the host, and one for the 
near-duplicates of the composite. 


Other Types of Media: Besides processing still images, some works in the literature 
have addressed the phylogeny reconstruction of assets belonging to other types of 
media, such as video (see Dias et al. 201 1; Lameri et al. 2014; Costa et al. 2015, 2016; 
Milani et al. 2017) and even audio (see Verde et al. 2017). In particular, Oikawa et 
al. (2016) have investigated the role of similarity computation between digital objects 
(such as images, video, and audio) in multimedia phylogeny. 


Provenance Analysis: The herein-mentioned literature of media phylogeny has 
made use of diverse metrics, individually focused on retrieving either the root, the 
leaves, or the ancestors of a node within a reconstructed phylogeny tree, evaluating 
the tree as a whole. Moreover, the datasets used in the experiments presented dif- 
ferent types of limitations, such as either containing only images in JPEG format or 
lacking compositions with more than two sources (cf. object 1 depicted in Fig. 15.7). 

Aware of these limitations and aiming to foster more research in the topic of 
multi-asset forensic analysis, the American Defense Advanced Research Projects 
Agency (DARPA) and the National Institute of Standards and Technology (NIST) 
have joined forces to introduce new terminology, metrics, and datasets (all herein 
presented, in the following sections), within the context of the Media Forensics 
(MediFor) project (see Turek 2021). Therefore, they coined the term Provenance 
Analysis to express a broader notion of phylogeny reconstruction, in the sense of 
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including not only the task of reconstructing the derivation stories of the assets of 
interest but also the fundamental step of retrieving these assets (as we discuss in 
Sect. 15.2). Furthermore, DARPA and NIST have suggested the use of directed 
acyclic graphs (a.k.a. provenance graphs), instead of groups of trees, to represent 
the derivation story of the assets better. 


15.2 Content Retrieval 


With the amount of content on the Internet being so vast, performing almost any 
type of computationally expensive analysis across the web’s entirety or even smaller 
subsets for that matter is simply intractable. Therefore, when setting out to tackle the 
task of provenance analysis, a solution must start with an algorithm for retrieving a 
reasonably-sized subset of relevant data from a larger corpus of media. With this in 
mind, effective strategies for content retrieval become an integral module within the 
provenance analysis framework. 

This section will focus specifically on image retrieval algorithms that provide 
results contingent on one or multiple images as queries into the system. This image 
retrieval is commonly known as reverse image search or more technically Content- 
Based Image Retrieval (CBIR). For provenance analysis, an appropriate CBIR algo- 
rithm should produce a corpus subset that contains a rich collection of images with 
relationships relevant to the provenance of the query image. 


15.2.1 Approaches 


Typical CBIR: Typical CBIR solutions employ multi-level representations of the 
processed images to reduce the semantic gap between the image pixel values (in 
the low level) and the system user’s retrieval intent (in the highlevel, see Liu et al. 
2007). Having provenance analysis in mind, the primary intent is to trace connections 
between images that mutually share visual content. For instance, when performing 
the query from Fig. 15.6, one would want a system to retrieve images that contain 
identical or near-identical structural elements (such as the corresponding images, 
respectively, depicting the Notorious B.I.G. and Kurt Cobain, both used to generate 
the composite query, but not other images of unrelated people sitting in cars). Con- 
sidering that, we can describe an initial concrete example of a CBIR system that is 
not entirely suited to provenance analysis and further gradually provide methods to 
make it more suitable to the problem at hand. 

A typical and moderately useful CBIR system for provenance analysis may rely 
on local features (a.k.a. keypoints) to obtain a low-level representation of the image 
content. Hence, it may consist of four steps, namely: (1) feature extraction, (2) 
feature compression, (3) feature retrieval, and (4) result ranking. Feature extraction 
comprises the task of computing thousands of n-dimensional representations for 
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(a) (b) 


Fig. 15.9 SURF keypoints extracted from a given image. Each yellow circle represents a keypoint 
location, whose pixel values generate an n-dimensional feature vector. In a, a regular extraction 
of SURF keypoints, as defined by Bay et al. (2008). In b, a modified keypoint extraction dubbed 
distributed keypoints, proposed by Moreira et al. (2018), whose intention is also to describe homoge- 
neous regions, such as the wrist skin and clapboard in the background. Images adapted from Daniel 
Moreira et al. (2018) 


each image, ranging from classical handcrafted technologies, such as Scale Invariant 
Feature Transform (SIFT, see Lowe 2004) and Speeded-up Robust Features (SURF, 
see Bay et al. 2008), to neural network learned methods, such as Learned Invariant 
Feature Transform (LIFT, see Yi et al. 2016) and Deep Local Features (DELF, see 
Noh et al. 2017). Figure 15.9a depicts an example of SURF keypoints extracted from 
a target image. 

Although the feature-based representation of images drastically reduces the 
amount of space needed to store their content, there is still a necessity for reduc- 
ing their size, a task performed during the second CBIR step of feature compression. 
Take, for example, a CBIR system that extracts 1,000 64-dimensional floating-point 
SURF features from each image. Using 4 bytes for each dimension, each image 
occupies a total of 4 x 64 x 1000 = 256,000 bytes (or 256kB) of memory space. 
While that may seem relatively low from a single-image standpoint, consider an 
image database containing ten million images (which is far from being an unrealis- 
tic number, considering the scale of the Internet). This would mean the need for an 
image index on the order of 25 terabytes. Instead, we recommend utilizing Optimized 
Product Quantization (OPQ, see Ge et al. 2013) to reduce the size of each feature 
vector. In summary, OPQ learns grids of bins arranged along different axes within the 
feature vector space. These bin grids are rotated and scaled within the feature vector 
space to distribute optimally feature vectors extracted from an example training set. 
This provides a significant reduction in the number of bits required to describe a 
vector while keeping relatively high fidelity. For the sake of illustration, a simplified 
two-dimensional example of OPQ bins is provided in Fig. 15.10. 

The third CBIR step (feature retrieval) aims at using the local features to index 
and compare, within the optimized n-dimensional space they constitute, and through 
Euclidean distance or similar method, pairs of image localities. Inverted File Indices 
(IVF, see Baeza-Yates and Ribeiro-Neto 1999) are the index structure commonly 
used in the feature retrieval step. Utilizing a feature vector binning scheme, such 
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Fig. 15.10 A simplified 2D representation of OPQ. A set of points in the feature space a is re- 
described by rotating and translating grids of m bins to high-density areas (b), in red. The grid 
intervals and axis are learned separately for each dimension. Each dimension of a feature point can 
then be described using only m bits 


as that of OPQ, an index is generated such that it stores a list of features contained 
within each bin. Each stored feature, in turn, points back to the image it was extracted 
from (hence the inverted terminology). The only calculation required is to determine 
the query feature’s respective bin within the IVF to retrieve the nearest neighbors to 
a given query feature. Once this bin is known, the IVF can return the feature vectors’ 
list in that bin and the nearest surrounding neighbor bins. Given that each feature 
vector points back to its source image, one can trace back the database images that 
are similar to the query. This simple yet powerful method provides easy scalability 
and distributability to index search. This is the main storage and retrieval structure 
used within powerful state-of-the-art search frameworks, such as the open-source 
Facebook Artificial Intelligence Similarity Search (FAISS) library introduced by 
Jeff Johnson et al. (2019). 

Lastly, the result ranking step takes care of polling the feature-wise most similar 
database images to the query. The simplest metric with which one can rank the 
relatedness of the query image with the other images is feature voting (see Pinto et 
al. 2017). To perform it, one must iterate through each query feature and its retrieved 
nearest neighbor features. Then, by checking which database image each returned 
feature belongs to, one must accumulate a tally of how many features are matched 
to the query for each image. This final tally is then utilized as the votes for each 
database image, and these images are ranked (or ordered) accordingly. An example 
of this method can be seen in Fig. 15.11. 

With these four steps implemented, one already has access to a rudimentary image 
search system capable of retrieving both near-duplicates and semantically similar 
images. While operational, this system is not particularly powerful when tasked 
with finding images with nuanced inter-image relationships relevant to provenance 
analysis. In particular, images that share only small amounts of content with the query 
image will not receive many votes in the matching process and may not be retrieved 
as a relevant result. Instead, this system can be utilized as a foundation for a retrieval 
system more fine-tuned to the task of image provenance analysis. In this regard, four 
methods to improve a typical CBIR solution are discussed in the following. 
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Fig. 15.11 A 2D simplified representation of IVF and feature voting. Local features are extracted 
from three images, representing a small gallery. The feature vector space is partitioned into a 
grid, and gallery features are indexed within this grid. Local features are then extracted from the 
query image of Reddit Superman (whose chest symbol and underwear are modified versions of the 
yawning cat’s mouth depicted within the first item of the gallery). Each local query feature is fed 
to the index, and nearest neighbors are retrieved. Colored lines represent how each local feature 
matches the IVF features and subsequently map back to individual gallery features. We see how 
these feature matches can be used in a voting scheme to rank image similarities in the bottom left 


Distributed Keypoints: To take the best advantage of these concepts, we must 
ensure that the extracted keypoints for the local features used within the indexing and 
retrieval pipeline do a good job describing the entire image. Most keypoint detectors 
return points that lie on corners, edges, or other high-entropy patches containing 
visual information. This, unfortunately, means that some areas within an image may 
be given too many vital points and may be over-described, while other areas may be 
entirely left out, receiving no keypoints at all. An area with no keypoints and, thus, 
no representation within the database image index has no chance of being correctly 
retrieved if a query with similar features comes along. 

To mitigate over-description and under-description of image areas, Moreira et 
al. (2018) proposed an additional step in the keypoint detection process, which is 
the avoidance and removal of keypoints that present too much overlap with others. 
Such elements are then replaced with keypoints coming from weaker-entropy image 
regions, allowing for a more distributed content description. Figure 15.9b depicts an 
example of applying this strategy, in comparison with the regular SURF extraction 
approach. 


Context Incorporation: One of the main reasons a typical CBIR solution performs 
poorly in an image provenance scenario is the nature behind many image manip- 
ulations within provenance cases. These manipulations often consist of composite 
images with small objects coming from donor images. An example of these types of 
relationships is shown in Fig. 15.12. 
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Fig. 15.12 An example of composite from the r/photoshopbattles subreddit. The relations of the 
composite with the images donating small objects, such as the hummingbird and the squirrel, 
challenge the retrieval capabilities of a typical CBIR solution. Context incorporation comes in 
handy in these situations. Image adapted from Joel Brogan et al. (2021) 


These types of image relationships do not lend themselves to a naive feature voting 
strategy, as the size of the donated objects is often too small to garner enough local 
feature matches to impart a high vote score with the query image. We can augment 
the typical CBIR pipeline with an additional step, namely context incorporation, 
to solve this problem. Context incorporation takes into consideration the top N 
retrieved images for a given query, to accurately localize areas within the query and 
the retrieved images that differ from each other (most likely due to a composite or 
manipulation), to generate attention masks, and to re-extract features over only the 
distinct regions of the query, for a second retrieval execution. By using only these 
features, the idea is that the additional retrieval and voting steps will be driven towards 
finding the images that have donated small regions to the query due to the absence 
of distractors. 

Figure 15.13 depicts an example where context incorporation is crucial to obtain 
the image that has donated Vladimir Putin (Fig. 15.13d) to a questioned query 
(Fig. 15.13a), in the first positions of the result rank. Different approaches for per- 
forming context incorporation, including attention mask generation, were proposed 
and benchmarked by Brogan et al. (2017), while an end-to-end CBIR pipeline 
employing such a strategy was discussed by Pinto et al. (2017). 


Iterative Filtering: Another requirement of content retrieval within provenance anal- 
ysis is the recovery of images directly related to the query and transitively related to 
it. That is the case of image 4 depicted within Fig. 15.6, which does not share content 
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Fig. 15.13 Context incorporation example. In a, the query image. In b, the most similar retrieved 
near-duplicate (top 1 image) through a typical CBIR solution. In c, the attention mask highlighting 
the different areas between a and b, after proper content registration. In d, the top | retrieved image 
after using c as anew query (a.k.a., context incorporation). The final result rank may be acombination 
of the two ranks after using a and c as a query, respectively. Image d is only properly retrieved, 
from the standpoint of provenance analysis, thanks to the execution of context incorporation 


directly with the query, but is related to it through image 1. Indeed, any other near- 
duplicates of image 4 should ideally be retrieved by a flawless provenance-aware 
CBIR solution. Aiming at also retrieving transitively related content to the query, 
Moreira et al. (2018) introduced iterative filtering. After retrieving the first rank of 
images, the results are iteratively refined by suppressing near-duplicates of the query 
and promoting non-near-duplicates as new queries to the next retrieval iteration. 
This process is executed a number of times, leading to a set of image ranks for each 
iteration, which are then combined into a single one, at the end of the process. 


Object-Level Retrieval: Despite the modifications above to improve typical CBIR 
solutions towards provenance analysis, local-feature-based retrieval systems do not 
inherently incorporate structural aspects into the description and matching process. 
For instance, any retrieved image with a high match vote score could still, in fact, 
be completely dissimilar to the query image. That happens because the matching 
process does not take into account the position of local features with respect to each 
other within an image. As a consequence, unwanted database images that contain 
features individually similar to the query’s features, but in a different pattern, may 
still be ranked highly. 
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Fig.15.14 An example of how OS2OS matching is accomplished. In a, a query image has the local 
feature keypoints mapped relative to a computed centroid location. In b, matching features from a 
database image are projected to an estimated centroid location using the vectors shown in a. In c, 
the subsequent vote accumulation matrix is created. In d, the accumulation matrix is overlayed on 
the database image. The red area containing many votes shows that the query and database images 
most likely share an object in that location 


To avoid this behavior, Brogan et al. (2021) modified the retrieval pipeline to 
account for the structural layout of local features. To do so, they proposed a method 
to leverage the scale and orientation components calculated as part of the SURF key- 
point extraction mechanism and to perform a transform estimation similar to the gen- 
eralized Hough voting (see Ballard 1981), relative to a computed centroid. Because 
each feature’s scale and orientation are known, for both the query’s and database 
images’ features, database features can be projected to the query keypoint layout 
space. Areas that contain similar structures accumulate votes in a given location on 
a Hough accumulator matrix. By clustering highly active accumulation areas, one 
1s able to quickly determine local areas shared between images that have structural 
consistency with each other. A novel ranking algorithm then leverages these areas 
within the accumulator to subsequently score the relationship between images. This 
entire process is called “objects in scenes to objects in scenes” (OS2OS) matching. 
An example of how the voting accumulation works is depicted in Fig. 15.14. 
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15.2.2 Datasets and Evaluation 


The literature for provenance analysis has reported results of content retrieval over 
two major datasets, namely the Nimble Challenge (NC17, 2017) and the Media 
Forensics Challenge (MFC18, 2018) datasets. 

NC17 (2017): On the occasion of promoting the Nimble Challenge 2017, NIST 
released an image dataset containing a development partition (Dev 1-Beta4) specifi- 
cally curated to support both research tasks for provenance analysis. Namely, prove- 
nance content retrieval (named provenance filtering by the agency) and provenance 
graph construction. This partition contains 65 image queries and 11,040 images either 
related to the queries through provenance graphs, or completely unrelated material 
(named distractors). The provenance graphs were manually created by image edition 
experts and include operations such as splicing, removal, cropping, scaling, blurring, 
and color transformations. As a consequence, the partition offers content retrieval 
ground truth composed of 65 expected image ranks. Aiming to increase the chal- 
lenge offered by this partition, Moreira et al. (2018) extended the set of distractors 
by adding one million unrelated images, which were randomly sampled from Eval- 
Ver1, another partition released by NIST as part of the 2017 challenge. We rely on 
this configuration to provide some results of content retrieval and explain how the 
different provenance CBIR add-ons explained in Sect. 15.2.1 contribute to solve the 
problem at hand. 

MFC18 (2018): Similar to the 2017 challenge, NIST released another image 
dataset in 2018, with a partition (Eval-Ver1-Part1) also useful for provenance content 
retrieval. Used to officially evaluate the participants of the MediFor program (see 
Turek 2021), this set contains 3,300 query images and over one million images, 
including content related to the queries and distractors. Many of these queries are 
composites, with the expected content retrieval image ranks provided as ground truth. 
Moreover, this dataset also provides ground-truth annotations as to whether a related 
image contributes only a particular small object to the query (such as in the case of 
image 4 donating Tupac Shakur’s head to image 1, within Fig. 15.7), instead of an 
entire large background. These cases are particularly helpful to assess the advantages 
of using the object-level retrieval approach presented in Sect. 15.2.1 in comparison 
to the other methods. 

As suggested in the protocol introduced by NIST (2017), the metric used to 
evaluate the performance of a provenance content retrieval solution is the CBIR 
recall of the images belonging to the ground truth rank, at three specific cut-off 
points. Namely, (i) R @50 (i.e., the percentage of ground-truth expected images that 
are retrieved among the top 50 assets returned by the content retrieval solution), (ii) 
R@100 (ie., the percentage of ground truth images retrieved among the top 100 
assets returned by the solution), and (iii) R @200 (i.e., the percentage of ground truth 
images retrieved among the top 200 assets returned by the solution). Since the recall 
expresses the percentage of relevant images being effectively retrieved, the method 
delivering higher recall is considered better. 
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In the following section, results in terms of a recall are reported for the different 
solutions presented in Sect. 15.2.1, over the aforementioned datasets. 


15.2.3 Results 


Table 15.1 summarizes the results of provenance content retrieval reported by Moreira 
et al. (2018) over the NC17 dataset. It helps to put into perspective some of the 
techniques detailed in Sect. 15.2.1. As one might observe, by comparing rows 1 and 
2 of Table 15.1, the simple modification of using more keypoints (from 2,000 to 5,000 
features) to describe the images within the CBIR base module already provides a 
significant improvement on the system recall. Distributed Keypoints, in turn, improve 
the recall of larger ranks (for R@100 and R@200), while Iterative Filtering alone 
allows the system to reach an impressive recall of 90% of the expected images among 
the top 50 retrieved ones. At the time of their publication, Moreira et al. (2018) found 
that a combination of Distributed Keypoints and Iterative Filtering led to the best 
content retrieval solution, represented by the last row of Table 15.1. 

More recently, Brogan et al. (2021) performed new experiments on the MFC18 
dataset, this time aiming at evaluating the performance of their proposed OS20S 
approach. Table 15.2 compares the results of the best solution previously identified 
by Moreira et al. (2018), in row 1, with the addition of OS2OS, in row 2, and 


Table 15.1 Results of provenance content retrieval over the NC17 dataset. Reported here are the 
average recall values of 65 queries at the top 50 (R @50), top 100 (R @ 100), and top 200 (R @200) 
retrieved images. Provenance add-ons on top of the CBIR base module were presented in Sect. 15.2.1 


CBIR Base Provenance | R@50 R@100 R@200 Source 
Add-ons 
2,000 SURF features, Context 71% 72% 74% Daniel 
OPQ Incorpora- Moreira 
tion et al. (2018) 
5.000 SURF features, Context 88% 88% 88% Daniel 
OPQ Incorpora- Moreira 
tion et al. (2018) 
5,000 SURF features, Distributed | 88% 90% 90% Daniel 
OPQ Keypoints Moreira 
et al. (2018) 
5,000 SURF features, Iterative 90% 90% 92% Daniel 
OPQ Filtering Moreira 
et al. (2018) 
5,000 SURF features, Distrib. Key- | 91% 91% 92% Daniel 
OPQ points, Moreira 
Iterative et al. (2018) 
Filtering 
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Table 15.2 Results of provenance content retrieval over the MFC18 dataset. Reported here are 
the average recall values of 3,300 queries at the top 50 (R@50), top 100 (R@100), and top 200 
(R @200) retrieved images. Provenance add-ons on top of the CBIR base module were presented 
in Sect. 15.2.1. OS2OS stands for “objects in scene to objects in scene”, previously presented as 
object-level retrieval 


CBIR Base Provenance | R@50 R@100 R@200 Source 
Add-ons 
5,000 SURF features, Distrib. Key- | 77% 81% 82% Joel Brogan 
OPQ points, et al. (2021) 
Iterative 
Filtering 
5,000 SURF features, Distrib. Key- | 83% 83% 84% Joel Brogan 
OPQ points, et al. (2021) 
OS20S 
1,000 DELF features, Iterative 87% 90% 91% Joel Brogan 
OPQ Filtering et al. (2021) 
1,000 DELF features, OS20S 91% 93% 95% Joel Brogan 
OPQ et al. (2021) 


replacement of the SURF features with DELF (see Noh et al. 2017) features, in rows 
3 and 4, over the MFC18 dataset. OS2OS alone (compare rows 1 and 2) improves the 
system recall for all the three rank cut-off points. Also, the usage of only 1,000 DELF 
features (a data-driven image description approach that relies on learned attention 
models, in contrast to the 5,000 handcrafted SURF features) significantly improves 
the system recall (compare rows | and 3). In the end, the combination of a DELF- 
based CBIR system and OS2OS leads to the best content retrieval approach, whose 
recall values are shown in the last row of Table 15.2. 

Lastly, as explained before, the MFC18 dataset offers a unique opportunity to 
understand how the content retrieval solutions work for the recovery of dataset images 
that eventually donated small objects to the queries at hand since it contains specific 
annotations in this regard. Table 15.3 summarizes the results obtained by Brogan et 
al. (2021) when evaluating this aspect. As expected, the usage of OS2OS greatly 
improves the recall of small-content donors, as it can be seen through rows 2 and 4 
of Table 15.3. Overall, the best content retrieval approach (a combination of DELF 
features and OS2OS) is able to retrieve only 55% of the expected small-content 
donors among the top 200 retrieved assets (see the last row of Table 15.3). This result 
indicates that more work still needs to be done in this specific aspect: improving the 
provenance content retrieval of small-content donor images. 
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Table 15.3 Results of provenance content retrieval over the MFC18 dataset, with focus on the 
retrieval of images that donated small objects to the queries. Reported here are the average recall 
values of 3,300 queries at the top 50 (R@50), top 100 (R@100), and top 200 (R @200) retrieved 
images. Provenance add-ons on top of the CBIR base module were presented in Sect. 15.2.1. OS20S 
stands for “objects in scene to objects in scene”, previously presented as object-level retrieval 


CBIR Base Provenance | R@50 R@100 R@200 Source 
Add-ons 
5,000 SURF features, Distrib. Key- | 28% 34% 42% Joel Brogan 
OPQ points, et al. (2021) 
Iterative 
Filtering 
5,000 SURF features, Distrib. Key- | 45% 48% 52% Joel Brogan 
OPQ points, et al. (2021) 
OS20S 
1,000 DELF features; Iterative 41% 45% 49% Joel Brogan 
OPQ Filtering et al. (2021) 
1,000 DELF features, OS20S 51% 54% 55% Joel Brogan 
OPQ et al. (2021) 


15.3 Graph Construction 


A provenance graph depicts the story of edits and manipulations underwent by a 
media asset. This section focuses on the provenance graph of images, whose vertices 
individually represent the image variants and whose edges represent the direct pair- 
wise image relationships. Depending on the transformations applied to one image 
to obtain another, the two connected images can share partial to full visual content. 
In the case of partial content sharing, the source images of the shared content are 
called the donor images (or simply donors), while the resultant manipulated image is 
called the composite image. In full-content sharing, we have near-duplicate variants 
when one image is created from another through a series of transformations such as 
cropping, blurring, and color changes. Once a set of related images is collected from 
the first stage of content retrieval (see Sect. 15.2), a fine-grained analysis of pairwise 
relationships is required to obtain the full provenance graph. This analysis involves 
two major steps, namely (1) image similarity computation and (2) graph building. 
Similarity computation involves understanding the degree of similarity between 
two images. It is a fundamental task for any visual recognition problem. Image 
matching methods are at the core of vision-based applications, ranging from hand- 
crafted approaches to modern deep-learning-based solutions. A matching method is 
a similarity (or dissimilarity) score that can be used for further decision-making and 
classification. For provenance analysis, computing pairwise image similarity helps 
distinguish between direct versus indirect relationships. A selection of a feasible set 
of pairwise relationships creates a provenance graph. To analyze the closest prove- 
nance match to an image in the provenance graph, pairwise matching is performed 
for all possible image pairs in the set of k retrieved images. The similarity scores are 
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then recorded in a matrix M of size k x k where each cell indexed M(i, j) represents 
the similarity between image I; and image /;. 

Graph building, in turn, comprises the task of constructing the provenance graph 
after similarity computation. The matrix containing the similarity scores for all pairs 
of images involved in the provenance analysis for each case can be interpreted as an 
adjacency matrix. This implies that each similarity score in this matrix is the weight 
of an edge in a complete graph of k vertices. Extracting a provenance graph requires 
selecting a minimal set of edges that span the entire set of relevant images or vertices 
(this can be different from k). If the similarity measure used for the previous stage is 
symmetric, the final graph will be undirected, whereas an asymmetric measure of the 
similarity will lead to a directed provenance graph. The provenance cases considered 
in the literature, so far, are spanning trees. This implies that there are no cycles within 
graphs, and there is at most one path to get from one vertex to another. 


15.3.1 Approaches 


There are multiple aspects of a provenance graph. Vertices represent the different 
variants of an image or visual subject, pairwise relationships between images (i.e., 
undirected edges) represent atomic manipulations that led to the evolution of the 
manipulated image, and directions for these relationships provide more precise infor- 
mation about the change. Finally, the last details are the specific operations performed 
on one image to create the other. The fundamental task for an image-based prove- 
nance analysis is, thus, performing image comparison. This stage requires describing 
an image using a global or a set of local descriptors. Depending on the methods used 
for image description and matching, the similarity computation stage can create dif- 
ferent types of adjacency weight matrices. The edge selection algorithm then depends 
on the nature of the computed image similarity. In the rest of this section, we present 
a series of six graph construction techniques that have been proposed in the literature 
and represent the current state of the art in image provenance analysis. 


Undirected Graphs: A simple and yet-effective graph construction solution was 
proposed by Bharati et al. (2017). It takes the top k retrieved images for the given 
query and computes the similarity between the two elements of every image pair, 
including the query, through keypoint extraction, description, and matching. Key- 
point extractors and descriptors, such as SIFT (see Lowe 2004) or SURF (see Bay et 
al. 2008), offer a manner to highlight the important regions within the images (such 
as corners and edges), and to describe their content in a way that is robust to several 
of the transformations manipulated images might have been through (such as scal- 
ing, rotating, and blurring.). The quantity of keypoint matches that are geometrically 
consistent with the others in the match set can act as an image similarity score for 
each image pair. 

As depicted in Fig. 15.15, two images that share visual content will present more 
keypoint matches (see Fig. 15.15a) than the ones that have nothing in common (see 
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Fig. 15.15 Examples of geometrically consistent keypoint matching. In a, the matching of key- 
points over two images that share visual content. In b, the absence of matches between images that 
do not look alike. The number of matching keypoint pairs can be used to express the similarity 
between two images 


Fig. 15.15b). Consequently, a symmetric pairwise image adjacency matrix can be 
built by simply using the number of keypoint matches between every image pair. 
Ultimately, a maximum spanning tree algorithm, such as Kruskal’s (1956) or Prim’s 
(1957), can be used to generate the final undirected image provenance graph. 


Directed Graphs: The previously described method has the limitation of generating 
only symmetric adjacency matrices, therefore, not providing enough information to 
compute the direction of the provenance graphs’ edges. As explained in Sect. 15.1, 
within the problem of provenance analysis, the direction of an edge within a prove- 
nance graph expresses the important information of which asset gives rise to the 
other. 

Aiming to mitigate this limitation and inspired by the early work of Dias et 
al. (2012), Moreira et al. (2018) proposed an extension to the keypoint-based image 
similarity computation alternative. After finding the geometrically consistent key- 
point matches for each pair of images (J;, /;), the obtained keypoints can be used for 
estimating the homography H;; that guides the registration of image J; onto image 
Ij, as well as the homography H;; that analogously guides the registration of image 
I; onto image J. 

In the particular case of Hj;, after obtaining the transformation T;(J;) of image 
I; towards 1;, T;(J;) and J; are properly registered, with T;(J;) presenting the same 
size of J; and the matched keypoints relying on the same position. One can, thus, 
compute the bounding boxes that enclose all the matched keypoints within each 
image, obtaining two correspondent patches Ri, within T;(1;), and R2, within J;. 
With the two aligned patches at hand, the distribution of the pixel values of R; can be 
matched to the distribution of R>, before calculating the similarity (or dissimilarity) 
between them. 

Considering that patches R, and R> have the same width W and height H after 
content registration, one possible method of patch dissimilarity computation is the 
pixel-wise mean squared error (M S E): 


yU ei (Ri(w, h) — Ro(w, h))2 
Hxw 


MSE(R,, Ro) = : (15.1) 
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where Rı(w, h) € [0, 255] and R2(w, h) € [0, 255] are the pixel values of R í and 
Ro at position (w, h), respectively. 

Alternatively to MSE, one can express the similarity between R1 and R2 as the 
mutual information (M I) between them. From the perspective of information theory, 
MI is the amount of information that one random variable contains about another. 
From the point of view of probability theory, it measures the statistical dependence 
of two random variables. In practical terms, assuming each random variable as, 
respectively, the aligned and color-corrected patches R; and Ro, the value of M I can 
be given by the entropy of discrete random variables: 


p(x, y) 
MI(R,, Ro) = .)1 . (052 
(Ri, Ro) > > p(x, y) log (= p(x, y) Dy pG, 5) (15.2) 


where x € [0, 255] refers to the pixel values of R;, and y € [0, 255] refers to the pixel 
values of R2. The p(x, y) value regards the joint probability distribution function of 
Ri and R2. As explained by Costa et al. (2017), it can be satisfactorily approximated 
by 

h(x, y) 


Yin AG)’ (15.3) 


PX, y= 


where h(x, y) is the joint histogram that counts the number of occurrences for each 
possible value of the pair (x, y), evaluated on the corresponding pixels for both 
patches Ri and R2. 

As aconsequence of their respective natures, while M S E is inversely proportional 
to the two patches’ similarity, MJ is directly proportional. Aware of this, one can 
either use (i) the inverse of the MSE scores or (ii) the MI scores directly as the 
similarity elements s;; within the pairwise image adjacency matrix M, to represent 
the similarity between image 7; and the transformed version of image J; towards 7j, 
namely T; (Jj). 

The homography H;; is calculated in an analogous way to Hj; with the difference 
that 7;(/;) is manipulated by transforming 1; towards J;. Due to this, the size of 
the registered images, the format of the matched patches, and the matched color 
distributions will be different, leading to unique MSE (or M I) values for setting s;;. 
Since s;; Æ Sji, the resulting similarity matrix M will be asymmetric. Figure 15.16 
depicts this process. 

Upon computing the full matrix, the assumption introduced by Dias et al. (2012) 
is that, in the case of s;; > sji, it would be easier to transform image J; towards 
image /;, than the contrary (i.e., Z; towards J;). Analogously, s;; < sj; would mean 
the opposite. This information can, thus, be used for edge selection. The oriented 
Kruskal (2012) solution (with a preference for higher adjacency weights) would help 
construct the final provenance graph. 
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Fig. 15.16 Generation of an asymmetric pairwise image adjacency matrix. Based on the comparison 
of two distinct content transformations for each image pair (/;, Ij), namely J; towards I; and vice- 
versa, this method allows the generation of an asymmetric pairwise image similarity matrix, which 
is useful for computing provenance graphs with directed edges 


Clustered Graph Construction: As an alternative to the oriented Kruskal, Moreira 
et al. (2018) introduced a method of directed provenance graph building, which 
leverages both symmetric keypoint-based and asymmetric mutual-information-based 
image similarity matrices. 

Inspired by Oikawa et al. (2015) and dubbed clustered graph construction, the 
idea behind such a solution is to group the available retrieved images in a way that 
only near-duplicates of a common image are added to the same cluster. Starting from 
the image query J, as the initial expansion point, the remaining images are sorted 
according to the number of geometrically consistent matches shared with 74, from 
the largest to the smallest. The solution then clusters probable near-duplicates around 
I4, as long as they share enough content, which is decided based upon the number 
of keypoint matches (see Daniel Moreira et al. 2018). Once the query’s cluster is 
finished (i.e., the remaining images do not share enough keypoint matches with the 
query), a new cluster is computed over the remaining unclustered images, taking 
another image of the query’s cluster as the new expansion point. This process is 
repeated iteratively by trying different images as the expansion point until every 
image belongs to a near-duplicate cluster. 

Once all images are clustered, it is time to establish the graph edges. Images 
belonging to the same cluster are sequentially connected into a single path without 
branches. This makes sense in scenarios containing sequential image edits where 
one near-duplicate is obtained on top of the other. As a consequence of the iterative 
execution and selection of different images as expansion points, the successful ones 
(i.e., the images that were helpful in the generation of new image clusters) fatally 
belong to more than one cluster, hence serving as graph bifurcation points. Orthogo- 
nal edges are established in such cases, allowing every near-duplicate image branch 
to be connected to the final provenance graph through an expansion point image as a 
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Fig. 15.17 Usage of metadata information to refine the direction of pairwise image provenance 
relationships. For the presented images, the executed manipulation operation could be either the 
splicing or the removal of the male lion. According to the image generation date metadata, the 
operation is revealed to be a splice since the image on the left is older. Adapted from Aparna 
Bharati et al. (2019) 


Joint. To determine the direction of every single edge, Moreira et al. (2018) suggested 
using the mutual information similarity asymmetry in the same way as depicted in 
Fig.15.16. 


Leveraging Metadata: Image comparison techniques may be limited depending 
upon the transformations involved in any given image’s provenance analysis. In cases 
where the transformations are reversible or collapsible, the visual content analysis 
may not suffice for edge selection during graph building. Specifically, the homogra- 
phy estimation and color mapping steps involved in asymmetric matrix computation 
for edge direction inference could be noisy. To make this process more robust, it 
is pertinent to utilize other evidence sources to determine connections. As can be 
seen from the example in Fig. 15.17, it is difficult to point out the plausible direction 
of manipulation with visual correspondence, but auxiliary information related to the 
image, mostly accessible within the image files (a.k.a. image metadata), can increase 
confidence in predicting the directions. 

Image metadata, when available, can provide additional evidence for directed 
edge inference. Bharati et al. (2019) identify highly relevant tags for the task. Spe- 
cific tags that provide the time of image acquisition and editing, location, editing 
operation, etc. can be used for metadata analysis that corroborates visual evidence 
for provenance analysis. An asymmetric heuristic-based metadata comparison par- 
allel to a symmetric visual comparison is proposed. The metadata comparison sim- 
ilarity scores are higher for image pairs (ordered) with consistency from more sets 
of metadata tags. The resulting visual adjacency matrix is used for edge selection, 
while the metadata-based comparison scores are used for edge direction inference. 
As explained in clustered graph construction, there are three parts of the graph build- 
ing method, namely node cluster expansion, edge selection, and assigning directions 
to edges. The metadata information can supplement the last two depending on the 
specific stage at which it is incorporated. As metadata tags can be volatile in the 
world of intelligent forgeries, a conservative approach is to use them to improve the 
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confidence of the provenance graph obtained through visual analysis. The proposed 
design enables the usage of metadata when available and consistent. 


Transformation-Aware Embeddings: While metadata analysis can improve the 
fidelity of edge directions in provenance graphs when available and not tampered 
with, local keypoint matching for visual correspondence faces challenges in image 
ordering for provenance analysis. Local matching is efficient and robust to finding 
shared regions between related images. This works well for connecting donors with 
composite images but can be insufficient in capturing subtle differences between 
near-duplicate images, which affect the ordering of long chains of operations. Estab- 
lishing sequences of images that vary slightly based on the transformations requires 
differentiating between slightly modified versions of the same content. 

Towards improving the reconstructions of globally-related image chains in prove- 
nance graphs, Bharati et al. (2021) proposed encoding awareness of the transforma- 
tion sequence in the image comparison stage. Specifically, the devised method learns 
transformation-aware embeddings to better order related images in an edit sequence 
or provenance chain. The framework uses a patch-based siamese structure trained 
with an Edit Sequence Loss (ESL) using sets of four image patches. Each set is 
expressed as quadruplets or edit sequences, namely (i) the anchor patch, which rep- 
resents the original content, (ii) the positive patch, a near-duplicate of the anchor 
after M image processing transformations, (iii) the weak positive patch, the positive 
patch after N transformations, and (iv) the negative patch, a patch that is unrelated 
to the others. The quadruplets of patches are obtained for training using a specific set 
of image transformations that are of interest to image forensics, particularly image 
phylogeny and provenance analysis, as suggested in Dias et al. (2012). For each 
anchor patch, random unit transformations are sequentially applied, one on top of 
the other’s result, allowing to generate positive and weak positive patches from the 
anchor, after M and M + N transformations, respectively. The framework aims at 
providing distance scores to pairs of patches, where the output score between the 
anchor and the positive patch is smaller than the one between the anchor and the 
weak positive, which, in turn, is smaller than the score between the anchor and the 
negative patch (as shown in Fig. 15.18). 

Given a feature vector for an anchor image patch a, two transformed derivatives 
of the anchor patch p (positive) and p’ (weak positive) where p = Ty (a) and p’ = 
Ty (Tu (a)), and an unrelated image patch from a different imagen, E SL is a pairwise 
margin ranking loss computed as follows: 


SL(a, p, p',n) =max(0, —y x (d(a, p') — d(a, n)) + pı)+ 
max(0, —y x (d(p, p') — d(p, n)) + m2)+ (15.4) 
max(0, —y x (d(a, p) — d(a, p')) + m3) 


Here, y is the truth function which determines the rank order (see Rudin and 
Schapire 2009) and u1, o, and u3 are margins corresponding to each pairwise 
distance term and are treated as hyperparameters. Both terms having the same sign 
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Fig. 15.18 Framework to learn transformation-aware embeddings in the context of ordering image 
sequences for provenance. Specifically, to satisfy an edit-based similarity precision constraint, i.e., 
d(a, p) < d(a, p’) < d(a, n). Adapted from Aparna Bharati et al. (2021) 


implies ordering is correct, and the loss is zero. A positive loss is accumulated when 
the ordering is wrong, and they are of opposite signs. 

The above loss is optimized, and the model corresponding to the best measure for 
validation is used for feature extraction from patches of test images. Features learned 
with the proposed technique are used to provide pairwise image similarity scores. 
The value d;; between images J; and J; is computed by matching the set of features 
(extracted from patches) from one image to the other using an iterative greedy brute- 
force matching strategy. At each iteration, the best match is selected as the pair 
of patches between image J; and image J; whose /2-distance is the smallest and 
whose patches did not participate in a match on previous iterations. This guarantees 
a deterministic behavior regardless of the order of the images, meaning that either 
comparing the patches of J; against J; or vice-versa will lead to the same consistent set 
of patch pairs. Once all patch pairs are selected, the average /2-distance is calculated 
and finally set as dj;. The inverse of dj; is then used to set both s;; and s;; within the 
pairwise image similarity matrix M, which in this case is a symmetric one. Upon 
computing all values within M for all possible image pairs, a greedy algorithm (such 
as Kruskal’s 1956) is employed to order these pairwise values and create an optimally 
connected undirected graph of images. 


Leveraging Manipulation Detectors: A challenging aspect of image provenance 
analysis is establishing high-confidence direct relationships between images that 
share a small portion of content. Keypoint-based approaches may not suffice as there 
may not be enough keypoints in the shared regions, and global matching approaches 
may not appropriately capture the matching region’s importance. To improve analysis 
of composite images where source images have only contributed a small region and 
determine the source image among a group of image variants, Zhang et al. (2020) pro- 
posed to combine a pairwise ancestor-offspring classifier with manipulation detection 
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approaches. They build the graph by combining edges based on both local feature 
matching and pixel similarity. 

Their proposed algorithm attempts to balance global and local features and match- 
ing scores to boost performance. They start by using a weighted combination of 
the matched SIFT keypoints and the matched pixel values for image pairs that can 
be aligned, and null for the ones that cannot be aligned. A hierarchical cluster- 
ing approach is used to group images coming from the same source together. For 
graph building within each determined cluster, the authors combine the likelihood 
of images being manipulated or extracted from a holistic image manipulation detec- 
tor (see Zhang et al. 2020) and the pairwise ancestor score extracted by an L2-Net 
(see Tian et al. 2017). The image manipulation detector uses a patch-based con- 
volutional neural network (CNN) to predict manipulations from a median-filtered 
residual image. For ambiguous cases where the integrity score may not be assigned 
accurately, a lightweight CNN-based ancestor-offspring network takes patch pairs 
as input and predicts one’s scores to be derived from the other. The similarity scores 
used as edge weights are the average of the integrity and the ancestor scores from 
the two used networks. The image with the highest score among the smaller set of 
images is considered as the source. All incoming links to this vertex are removed to 
reduce confusion in directions. This one is then treated as the root of the arborescence 
built by applying Chu-Liu/Edmonds’ algorithm (see Chu 1965; Edmonds 1967) on 
pairwise image similarities. 

The different arborescences are connected by finding the best-matched image 
pair among the image clusters. If the matched keypoints are above a threshold, 
these images are connected, indicating a splicing or composition possibility. As 
reported in the following section, this method obtains state-of-the-art results on the 
NIST challenges (MFC18 2018 and MFC19 2019), and it significantly improves 
the computation of the edges of the provenance graphs over the Reddit Photoshop 
Battles dataset (see Brogan 2021). 


15.3.2 Datasets and Evaluation 


With respect to the step of provenance graph construction, four datasets stand out 
as publicly available and helpful benchmarks, namely NC17 (2017), MFC18 (2018) 
(both discussed in Sect. 15.2.2), MFC19 (2019), and the Reddit Photoshop Battles 
dataset (2021). 

NC17 (2017): As mentioned in Sect. 15.2.2, this dataset contains an interesting 
development partition (Dev1-Beta4), which presents 65 image queries, each one 
belonging to a particular manually curated provenance graph. As expected, these 
provenance graphs are provided within the partition as ground truth. The number of 
images per provenance graph ranges from two to 81, with the average graph order 
being equal to 13.6 images. 

MFC18 (2018): Besides providing images and ground truth for content retrieval 
(as explained in Sect. 15.2.2), the Eval-Ver1-Part1 partition of this dataset also pro- 
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Fig. 15.19 A representation of how provenance graphs were obtained from the users’ interactions 
within the Reddit r/photoshopbattles subreddit. In a, the parent-child posting structure of comments, 
which are used to generate the provenance graph depicted in b. Equivalent edges across the two 
images have the same color (either green, purple, blue, or red). Adapted from Daniel Moreira et al. 
(2018) 


vides provenance graphs and 897 queries aiming at evaluating graph construction. 
For this dataset, the average graph order is 14.3 images, and the resolution of its 
images is larger, on an average, when compared to NC17. Moreover, its provenance 
cases encompass a larger set of applied image manipulations. 

MFC19 (2019): A more recent edition of the NIST challenge released a larger set 
of provenance graphs. The Eval-Part1 partition within the MFC19 (2019) dataset has 
1,027 image queries, and the average order of the provided ground truth provenance 
graphs is equal to 12.7 image vertices. In this group, the number of types of image 
manipulations used to generate the edges of the graphs was almost twice the number 
of MFC18. 

Reddit Photoshop Battles (2021): Aiming at testing the image provenance anal- 
ysis solutions over more realistic scenarios, Moreira et al. (2018) introduced the 
Reddit Photoshop Battles dataset. This dataset was collected from images posted to 
the Reddit community known as r/photoshopbattles (2012), where professional and 
amateur image manipulators share doctored images. Each “battle” starts with a teaser 
image posted by a user. Subsequent users post modifications of either the teaser or 
previously submitted manipulations in comments to the related posts. By using the 
underlying tree comment structure, Moreira et al. (2018) were able to infer and col- 
lect 184 provenance graphs, which together contain 10,421 original and composite 
images. Figure 15.19 illustrates this provenance graph inference process. 
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To evaluate the available graph construction solutions, two configurations are 
proposed by the NIST challenge (2017). In the first one, named the “oracle” scenario, 
there is a strong focus on the graph construction task. It assumes that a flawless 
content retrieval solution is available, thus, starting from the ground-truth content 
retrieval image ranks to build the provenance graphs, with neither missing images nor 
distractors. In the second one, named “end-to-end” scenario, content retrieval must be 
performed before graph construction, thus, delivering imperfect image ranks (with 
missing images or distractors) to the step of graph construction. We rely on both 
configurations and on the aforementioned datasets to report results of provenance 
graph construction, in the following section. 

Metrics: As suggested by NIST (2017), given a provenance graph G(V, E) gener- 
ated by a solution whose performance we want to assess, we compute the F 1-measure 
(i.e., the harmonic mean of precision and recall) of the (i) retrieved image vertices 
V and of the (ii) established edges E, when compared to the ground truth graph 
G'(V', E'), with its V’ and E’ homologous components. The first metric is named 
vertex overlap (V O) and the second one is named edge overlap (E O), respectively: 


IViNV| 

VO(G’,G)= 2x ——— (15.5) 
IVI + |V| 

EO(G',G)= 2 eye (15.6) 

š = eo . 
|E’| + |E| 


Moreover, we compute the vertex and edge overlap (V E O), which is the F 1-measure 
of retrieving both vertices and edges simultaneously: 


IVAVI+IE'N E| 


VEO(G',G)= 2x — 
IV") + IVI + |E + |Ë] 


(15.7) 


In a nutshell, these metrics aim at assessing the overlap between G and G'. The 
higher the values of VO, EO, and V E O, the better the performance of the solution. 
Finally, in the particular case of EO and VEO, when they are both assessed for 
an approach that does not generate directed graphs (such as Undirected Graphs and 
Transformation-Aware Embeddings, presented in Sect. 15.3.1), an edge within F is 
considered a hit (i.e., a correct edge) when there is a homologous edge within E’ that 
connects equivalent image vertices, regardless of the edges’ directions. 


15.3.3 Results 


Table 15.4 puts in perspective the different provenance graph construction approaches 
explained in Sect. 15.3.1, when executed over the NC17 dataset. The provided results 
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were all collected in oracle mode, hence the high values of V O (above 0.9), since 
there are neither distractors nor missing images in the rank lists used to build the 
provenance graphs. A comparison between rows | and 2 within this table shows 
the efficacy of leveraging image metadata as additional information to compute the 
edges of the provenance graphs. The values of E O (and V E O, consequently) have 
a significant increase (from 0.12 to 0.45, and from 0.55 to 0.70, respectively), when 
metadata is available. In addition, by comparing rows 3 and 4, one can observe the 
contribution of the data-driven Transformation-Aware Embeddings approach, in the 
scenario where only undirected graphs are being generated. In both cases, the gener- 
ated edges have no direction by design, making their edge overlap conditions easier 
to be achieved (since the order of the vertices within the edges become irrelevant for 
the computation of EO and V EO, justifying their higher values when compared 
to rows 1 and 2). Nevertheless, contrary to the first two approaches, these solutions 
are not able to define which image gives rise to the other within the established 
provenance edges. 

Table 15.5 compares the current state-of-the-art solution (Leveraging Manipula- 
tion Detectors by Xu Zhang et al. 2020) with the official NIST challenge participation 
results of the Purdue-Notre Dame team (2018), for both MFC18 and MFC 19 datasets. 
In both cases, the reported results refer to the more realistic end-to-end scenario, 
where performers must execute content retrieval prior to building the provenance 
graphs. As a consequence, the image ranks fed to the graph construction step are 
noisy, since they contain both missing images and distractors. For all the reported 
cases, the image ranks had 50 images and presented an average R@50 of around 
90% (i.e., nearly 10% of the needed images are missing). Moreover, nearly 35% of 
the images within the 50 available ones in a rank are distractors, on average. The 


Table 15.4 Results of provenance graph construction over the NC17 dataset. Reported here are 
the average vertex overlap (V O), edge overlap (EO), and vertex and edge overlap (V EO) values 
of 65 queries. These experiments were executed in the “oracle” scenario, where the image ranks 
fed to the graph construction step are perfect (i.e., with neither distractors nor missing images) 


Graph VO EO VEO Source 
construction 

approach 

Clustered graph | 0.93 0.12 0.55 Daniel Moreira 
construction et al. (2018) 
Clust. Graph 0.93 0.45 0.70 Aparna Bharati 
Const., et al. (2019) 
Leveraging 

Metadata 

Undirected 0.90 10.65 10.78 Aparna Bharati 
Graphs et al. (2021) 
Transformation- | 1.00 10.68 10.85 Aparna Bharati 
Aware et al. (2021) 
Embeddings 


TValues collected over undirected edges 
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Table 15.5 Results of provenance graph construction over the MFC18 and MFC19 datasets. 
Reported here are the average vertex overlap (V O), edge overlap (E O), and vertex and edge 
overlap (V E O) values of 897 queries, in the case of MFC18, and of 1,027 queries, in the case of 
MFC19. These experiments were executed in the “end-to-end” scenario, thus, building graphs upon 


imperfect image ranks (i.e., with distractors or missing images) 


Dataset Graph VO EO VEO Source 
Const. approach 

MFC18 Clustered Graph | 0.80 0.27 0.54 MFC19 
Construction (2019) 
Leveraging 0.82 0.40 0.61 Xu Zhang 
Manipulation et al. (2020) 
Detectors 

MFC19 Clustered Graph | 0.70 0.30 0.52 MFC19 
Construction (2019) 
Leveraging 0.85 0.42 0.65 Xu Zhang 
Manipulation et al. (2020) 
Detectors 


Table 15.6 Results of provenance graph construction over the Reddit Photoshop Battles dataset. 
Reported here are the average vertex overlap (V O), edge overlap (E O), and vertex and edge overlap 
(V EO) values of 184 queries. These experiments were executed in the “oracle” scenario, where the 
image ranks fed to the graph construction step are perfect. “N.R.” stands for not-reported values 


Graph VO EO VEO Source 
construction 

approach 

Clustered Graph | 0.76 0.04 0.40 Daniel Moreira 
Construction et al. (2018) 
Clust. Graph 0.76 0.09 0.42 Aparna Bharati 
Const., et al. (2019) 
Leveraging 

Metadata 

Leveraging N.R. 0.21 N.R. Xu Zhang et al. 
Manipulation (2020) 
Detectors 


best solution (contained in rows 2 and 4 within Table 15.5) still delivers low values 
of EO when compared to V O, revealing an important limitation of the available 
approaches. 

Table 15.6, in turn, reports results on the Reddit Photoshop Battles dataset. As one 
might observe, especially in terms of E O, this set is a more challenging one for the 
graph construction approaches, except for the state-of-the-art solution (Leveraging 
Manipulation Detectors by Xu Zhang et al. (2020). While methods that have worked 
fairly well on the NC17, MFC18, and MFC19 datasets drastically fail in the case 
of the Reddit dataset (see E O values below 0.10 in the case of rows 1 and 2), the 
state-of-the-art approach (in the last row of the table) more than doubles the results 
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of EO. Again, even with this improvement, increasing the values of EO within 
graph construction solutions is still an open problem that deserves attention from 
researchers. 


15.4 Content Clustering 


In the study of human communication on the Internet and the understanding of the 
provenance of trending assets, such as memes and other forms of viral content, 
the users’ intent is mainly focused on the retrieval, selection, and organization of 
semantically similar objects, rather than the gathering of near-duplicate variants or 
compositions that are related to a query. Under this scenario, although the step of 
content retrieval may be useful, the step of graph construction loses its purpose, 
since the available content is preferably related through semantics (e.g., different 
people on diverse scenes doing the same action, such as the “dabbing” trend depicted 
on Fig.15.8), greatly varying in appearance, making the techniques presented on 
Sect. 15.3 less suitable. 

Humans can only process so much data at once—if too much is present, they begin 
to be overwhelmed. In order to help facilitate the human processing and perception 
of the retrieved content, Theisen et al. (2020) proposed a new stage to the provenance 
pipeline focused on image clustering. Clustering the images based on shared elements 
helps triage the massive amounts of data that the pipeline has grown to accommodate. 
Nobody can reasonably review several million images to find emerging trends and 
similarities, especially without ordering in the image collection. However, when these 
images are grouped based on shared elements using the provenance pipeline, the 
number of items a reviewer would have to look at can decrease by several magnitude 
orders. 


15.4.1 Approach 


Object-Level Content Indexing: From the content retrieval techniques discussed 
in Sect. 15.2, Theisen et al. (2020) recommended the use of OS2OS matching (see 
Brogan et al. 2021) to index the millions of images eventually available, due to two 
major reasons. Firstly, to obtain a fast and scalable content retrieval engine. Secondly, 
to benefit from the OS2OS capability of comparing images through either large and 
global content matching or through many small object-wise local matches. 


Affinity Matrix Creation: In the task of analyzing a large corpus of assets shared 
on the web to understand the provenance of a trend, the definition of a query (i.e., 
a questioned asset) is not always as straightforward as it is in the image provenance 
analysis case. For example, the memes shared during the 2019 Indonesian elections 
and discussed in William Theisen et al. (2020). In such cases, a natural question to ask 
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would be which memes in the dataset one should use as the queries, for performing 
the first step of content retrieval. 

Inspired by a simplification of iterative filtering (see Sect. 15.2.1), Theisen et 
al. (2020) identified the cheapest option as being randomly sampling images from 
the dataset and iteratively using them as queries, for executing content retrieval until 
all the dataset images (or a sufficient number of them) are “touched” (i.e., they are 
retrieved by the content retrieval engine). There are several advantages to this, other 
than being easy to implement. Randomly sampling means that end-users would need 
to have no prior knowledge of the dataset and potential trends they are looking for. 
The cluster created at the end of the process would show “emergent” trends, which 
could even surprise the reviewer. 

On the other hand, from a more forensics-driven perspective, it is straightforward 
to imagine a system in which “informed queries” are used. If the users already 
suspect that several specific trends may exist in the data, cropped images of the 
objects pertaining to a trend may be used as a query, thus, prioritizing the content 
that the reviewers are looking for. This might be a demanding process because the 
user must already have some sort of idea of the landscape of the data and must 
produce the query images themselves. 

Following the suit in the solution proposed in William Theisen et al. (2020), 
prior to performing clustering, a particular type of image pairwise adjacency matrix 
(dubbed affinity matrix) must first be generated. By leveraging the provenance 
pipeline’s pre-existing steps, this matrix can be constructed based on the retrieval of 
the many selected queries. To prepare for the clustering step, a number of queries need 
to be run through the content retrieval system, the number of queries, and the recall 
of them depending on what type of graph the user wants to model for the analyzed 
dataset. Using the retrieval step’s output, a number of query-wise “hub” nodes are 
naturally generated, each of them having many connections. Consider, for example, 
that the user has decided to have a recall of the top 100 best matches for any selected 
query. This means that for every query submitted, there are 100 connections to other 
assets. These connections link the query image to each of the 100 matches, thus, 
imposing a “hubness” onto the query image. For the sake of illustration, Fig. 15.20 
shows an example of this process, for the case of memes shared during the 2019 
Indonesian elections. 

By varying the recall and number of queries, the affinity matrix can be generated, 
and an order can be imposed. Lowering the recall will result in a decrease in hub- 
like nodes in the structure but will require more queries to be run to connect all the 
available images. The converse is also true. Once enough queries have been run, the 
final step of clustering can be executed. 


Spectral Clustering: After the affinity matrix has been generated, Theisen et 
al. (2020) suggested the use of multiclass spectral clustering (see Stella and Shi 
2003) to organize the available images. In the case of memes, this step has the effect 
of assigning each image to a hypothesized meme genre, similar to the definition pre- 
sented by Shifman (2014). The most important and perhaps trickiest part is deciding 
on a number of clusters to create. While more clusters may allow for a more targeted 
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Fig. 15.20 A demonstration of the type of structure that may be inferred from the output of content 
retrieval, with many selected image queries. According to global and object-level local similarity, 
images with similar sub-components should be clustered together, presenting connections to their 
respective queries. Image adapted from William Theisen et al. (2020) 


look into the dataset, it increases processing time for both the computer generating 
the clusters and the human reviewing them at the end. In their studies, Theisen et 
al. (2020) pointed out that 150 clusters appeared sufficient when the number of total 
dataset available images was on the order of millions of samples. This allowed for a 
maximum amount of images per cluster that a human could briefly review, but not 
so few images that trends could not be noticed. 


15.4.2 Datasets and Evaluation 


Indonesian Elections Dataset: The alternative content clustering stage inside the 
framework of provenance analysis aims to unveil emerging trends in large image 
datasets collected around a topic or an event of interest. To evaluate the capability 
of the proposed solutions, Theisen et al. (2020) collected over two million images 
from social media, concentrated around the 2019 Indonesian presidential elections. 
Harvested from Twitter (2021) and Instagram (2021) over 13 months (from March 
31, 2018, to April 30, 2019), this dataset spans an extensive range of emotion and 
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Fig. 15.21 Imposter-host test to measure the quality of the computed image clusters. Five images 
are presented at a time to a human subject, four of which belong to the same host cluster. The 
remaining one belongs to an unrelated cluster, therefore, being named imposter. The human subject 
is always asked to identify the imposter, which in this example is the last image. A high number 
of correct answers given by humans indicate that the clustering solution was successful. Image 
adapted from William Theisen et al. (2020) 


thought surrounding the elections. The images are publicly available at https://bit. 
ly/2RjOodI. 


Human Understanding Assessments: Keeping the objective of aiding the human 
understanding of large image datasets in mind, a metric is needed to measure how 
meaningful an image cluster is for a human. Inspired by the work of Weninger et 
al. (2012), Theisen et al. (2020) proposed an imposter-host test, which is performed 
by showing a person a collection of N images, all but one of which are from a 
common “host” cluster, which was previously computed by the proposed solution, 
whose quality needs to be measured. The other item, the “imposter”, is randomly 
selected from one of the other established clusters. The idea is that the more related 
the images in a single cluster are, the easier it should be for a human to pick out 
the imposter image. Figure 15.21 depicts an example of this configuration. To test 
the results of their studies, Theisen et al. (2020) hired Amazon Mechanical Turk 
workers (2021) to perform 25 of these imposter-host tasks each subject, with N 
equal to 5 images, in order to measure the ease with which a human can pick out an 
imposter from the clusters that were generated. 
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Fig.15.22 Accuracy of the imposter-host tasks given to Amazon Mechanical Turk workers, relative 
to cluster size. The larger the point on the accuracy axis, the more images were in the cluster. The 
largest cluster is indicated as having 15.89% of all the dataset images. Image adapted from William 
Theisen et al. (2020) 


15.4.3 Results 


The Amazon Mechanical Turk experiments reported by Theisen et al. (2020) demon- 
strated that the provenance pipeline’s clustering step can produce human interpretable 
clusters while minimizing the average cluster size. The average accuracy for the 
imposter-host test was 62.42%. If the worker is shown five images, the chance of 
correctly guessing would be 1/5 (20%). Therefore, the reported average is far above 
the baseline, thus, demonstrating the salience of the trends discovered in the clus- 
ters to human reviewers. The median cluster size was only 132 images per cluster. 
A spread of the cluster sizes as related to the worker accuracy for a cluster can be 
seen in Fig. 15.22. Surprisingly even the largest cluster, containing 15.89% of all the 
images, has an accuracy still higher than random chance. Three examples of what 
an individual cluster could look like can be seen in Fig. 15.23. 


15.5 Open Issues and Research Directions 


State of the Art: Based on the results presented and discussed in the previous 
sections within this chapter, one can safely conclude that, in the current state of the 
art of provenance analysis, the available solutions are indeed proven to be effective 
for at least two tasks, namely (1) authenticity verification of images and (2) the 
understanding of image-sharing trends online. 

As for the verification of the authenticity of a questioned image (i.e. the query), 
this is done through the principled inspection of a given corpus of potentially related 
images to the query. In this case, whenever a non-empty set of related images is 
retrieved and a non-empty provenance graph (either directed or undirected) is gen- 
erated and presented, one might infer the edit history of the query and take it as 
evidence of potential conflicts that might attest against its authenticity, such as inad- 
vertent manipulations or source misattributions. 

Regarding the understanding of trending content online, this can be done during 
the unfolding of a target event, such as national elections or international sports 
competitions. In this case, the quick retrieval of related images and subsequent 
content-based clustering have the potential to unveil emergent trends and surface 
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EKONOMI INDONESIA: 


(c) 


Fig. 15.23 Three examples of what the obtained clusters may look like. In a and b, it is very easy 
to see the similarity shared among all the images. In c, it is not quite as clear, but if one were to look 
a little more closely, they might notice the stylized “01” logo in each of the images. This smaller, 


shared component is what makes object-level matching and local feature analysis a compelling tool. 
Image adapted from William Theisen et al. (2020) 
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their provenance, allowing for a better comprehension of the event itself, as well 
as its relevance. A summary of what the public is thinking about an event and who 
the actors are trying to steer public opinion is only a couple of the possibilities that 
provenance content clustering may allow one to do. 

While results have been very promising, no solution is without flaws, and much 
work is left to be done. In particular, image provenance analysis still has many caveats, 
which are briefly described below. Similarly, multimodal provenance analysis is an 
unexplored field, which deserves a great deal of attention from researchers shortly. 


Image Provenance Analysis Caveats: There are open problems in image prove- 
nance analysis that still need attention from the scientific community. For instance, 
the solutions proposed to compute the directions of the edges of the provenance 
graphs still deliver values of edge overlap that are inferior to the results for vertex 
recall and vertex overlap. This aspect indicates that the proposed solutions are better 
at determining which images must be added to the provenance graphs as vertices, but 
there is still plenty of room to improve the computation of how these vertices must be 
connected. In this regard, novel techniques of learning which image might have been 
acquired or generated first and leveraging the output of single-image manipulation 
detectors (tackled throughout this book) are desired. 

Moreover, an unexplored aspect of image provenance analysis is understanding, 
representing, and detecting the space of image transformations used to generate 
one image from the other. By doing so, one will determine what transformations 
might have been performed during the establishment of an edge, a problem that 
currently still lacks solutions. Thankfully, through the recent joint efforts of DARPA 
and NIST within the MediFor program (2021), a viable regime for data generation 
and annotation at the level of registering the precisely applied image transformations 
that have been performed from one asset to the other has emerged. This regime’s 
outcome is available to the scientific community as a useful benchmark (see Guan 
et al. 2019) for further research. 


Multimodal Provenance Analysis: As explained in Sect. 15.3, we have already wit- 
nessed that additional information such as image metadata helps to improve prove- 
nance analysis. Moving forward, another important research direction that requires 
progress is the identification and principled usage of information coming from other 
asset modalities (such as the text of image captions, in the case of questionable 
assets that are rich documents), as well as the development of provenance analysis 
for media types other than images (e.g., the provenance of text documents, videos, 
audios, etc.). 

When one considers the multimodal aspect, many questions appear and are open 
for research. One of them is how to analyze complex assets, such as the texts coming 
from different suspect documents (looking for cases of plagiarism and attribution 
to the original author), or the images extracted from scientific papers (which may 
be inadvertently reused to fabricate scientific findings, such as what was recently 
described in PubPeer Foundation (2020)). Another question is how to leverage images 
and captions in a document, or the video frames and their respective audio subtitles, 
within a movie. 
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Video provenance analysis, in particular, is a topic that deserves a great deal 
of attention. While one might assume image-based methods could be extended to 
video by being run over multiple frames, such a solution would fail to glean videos’ 
inherent temporal dimension. Sometimes videos shared on social media such as 
Instagram (2021) or TikTok (2021) are composites of one or more pieces of viral 
footage. Tracking the provenance of videos is as important as tracking the provenance 
of still images. 

Document provenance analysis, in turn, has also gained attention lately due to 
the recent pandemic of bad science (see Scheirer 2020). The COVID-19 crisis has 
caused a subsequent explosion in scientific publications surrounding the pandemic, 
not all of which might be considered highly rigorous. Using a provenance pipeline 
to aid in document analysis could allow reviewers to find repeated figures across 
many publications. It would be the reviewers’ decision, though, to determine if the 
repetition is something as simple as citing the previous work or something more 
nefarious like trying to claim pre-existing work as one’s own. 


Towards an End-user Provenance Analysis System: Finally, for the solutions 
discussed in this chapter to be truly useful, they require front-end development and 
back-end integration, which are currently missing. Such a system would allow users 
to perform tasks similar to reverse image search, with the explicit intent of finding 
the origins of all related content within the query asset in question. If a front-end 
requiring minimal user skill to use could be designed, provenance analysis could be 
consolidated as a powerful tool for fighting fake news and misinformation. 


Table 15.7 List of datasets useful for provenance analysis 


Par š QR 

Dataset Year Description Link Code 
Reddit 184 provenance graphs involving bepari /bit: EER 
Photoshop 2018 10,421 images collected from the ly/34£rL8a P. £ 3 
Battles Reddit PhotoshopBattles community. ae ne 

A Reunion of NC17 
Open Media . 
Forensics 2020 MFC18, MFC19 and more Httpe::/ soit: 


recent sets, with useful partitions for ly/3svh81R 


Challenge ; 
provenance analysis. 
Indonesian Over two million images collected Pepa ya ae mrsm 
i Í š š a PN 
Election 2021 from Twitter and Instagram about the 1y/2Rj0odI Rsa 


Memes 2019 Indonesian elections. 
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15.6 Summary 


The current state of the art of provenance analysis is mature enough for aiding 
image authenticity verification by processing a corpus of images rather than the 
processing of a single and isolated asset. However, more work needs to be done to 
improve the quality of the generated provenance əraphs concerning the direction and 
identification of image transformations associated with the graphs’ edges. Besides, 
provenance analysis still needs to be extended to media types other than images, such 
as video, audio, and text. In this regard, both the development of algorithms and the 
collection of datasets are yet to be done, revealing a unique research opportunity. 


Provenance Analysis Datasets: Table 15.7 enlists the currently available and useful 


datasets for image provenance analysis. 


Table 15.8 List of implementations of provenance analysis solutions 


š ¿q : QR 
Solution Year Description Link Code 
Context 2017 Implementation of content retrieval https://bit. 
Incorporation based on context incorporation. ly/3daU2Ju 
Undirected Implementation of graph httpas//pik. 

Graph 2017 construction that generates ly /3wRi2oM 
Construction undirected graphs. ” aoe 
Implementation of content retrieval 
Provenance based on distributed keypoints and f mpm 
Analysis at 2018 iterative filtering, and of graph httpss//bit. FN: 
PERA construction that generates directed ly/3mKlVev EBJESEP 
graphs, including the clustered 
approach. 
Implementation of provenance i 
NIST ; . I httpss//biti. 
MedisScore 2018 analysis metrics and evaluation ly/3anUMsN 
toolkit provided by NIST. 
Leveraging Implementation of content retrieval i [s] [=] 
ñ - : š : mee 
Manipulation 2020 and of graph construction supported rie x 4 oo eer 
Detectors by image manipulation detectors. k 2 eure 
Object-Level Implementation of the OS2OS Keegan) aie 
Content 2021  object-level content matching and o SOTM I 
Retrieval retrieval strategy. -Y 
Transform- Implementation of graph : | 
Aware 2021 construction based on sah h — ` ; 
Embeddings transformation-aware embeddings. =y a i 
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Provenance Analysis Source Code: Table 15.8 summarizes the currently available 
source code for Image provenance analysis. 
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Part IV 
Counter-Forensics 


Chapter 16 A) 
Adversarial Examples in Image Forensics i" 


Mauro Barni, Wenjie Li, Benedetta Tondi, and Bowen Zhang 


Abstract We describe the threats posed by adversarial examples in an image forensic 
context, highlighting the differences and similarities with respect to other applica- 
tion domains. Particular attention is paid to study the transferability of adversarial 
examples from a source to a target network and to the creation of attacks suitable 
to be applied in the physical domain. We also describe some possible countermea- 
sures against adversarial examples and discuss their effectiveness. All the concepts 
described in the chapter are exemplified with results obtained in some selected image 
forensics scenarios. 


16.1 Introduction 


Deep neural networks, most noticeably Convolutional Neural Networks (CNNSs), 
are increasingly used in image forensic applications due to their superior accuracy 
in detecting a wide number of image manipulations, including double and multiple 
JPEG compression (Wang and Zhang 2016; Barni et al. 2017), median filtering (Chen 
et al. 2015), resizing (Bayar and Stamm 2018), contrast manipulation (Barni et al. 
2018), splicing (Ahmed and Gulliver 2020). Good performance of CNNs have also 
been reported for image source attribution, i.e., to identify the model of the camera 
which acquired a certain image (Bondi et al. 2017; Freire-Obregon et al. 2018; Bayar 
and Stamm 2018), and deepfake detection (Rossler et al. 2019). Despite the good 
performance they achieve, the use of CNNs in security-oriented applications, like 
image forensics, is put at risk by the easiness with which adversarial examples can 
be built (Szegedy et al. 2014; Carlini and Wagner 2017; Papernot et al. 2016). As a 
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matter of fact, an attacker who has access to the Internal details of the CNN used for 
a certain image recognition task can easily build an attacked image which is visually 
indistinguishable from the original one but is misclassified by the CNN. Such a 
problem is currently the subject of an intense research activity, yet no satisfactory 
solution has been found yet (see Akhtar and Ajmal (2018) for a survey on this 
topic). The problem is worsened by the observation that adversarial attacks are often 
transferable from the target network to other networks designed for the same task 
(Papernot et al. 2016). This means that even in a Limited Knowledge (LK) scenario, 
wherein the attacker has only partial information about the to-be-attacked network, 
he can attack a surrogate network mimicking the target one and the attack will be 
effective also on the target network with good probability. Such a property opens the 
way toward very powerful attacks that can be used in real applications wherein the 
attacker does not have full access to the attacked system (Papernot et al. 2016). 
While some recent works have shown that CNN-based image forensics tools are 
also susceptible to adversarial examples (Guera et al. 2017; Marra et al. 2018), some 
differences exist between the generation of adversarial examples targeting computer 
vision networks and networks trained to solve image forensic problems. For instance, 
in Barni et al. (2019), it is shown that adversarial examples aiming at deceiving image 
forensic networks are less transferable than those addressing computer vision tasks. 
A possible explanation for such a different behavior could be the different kinds 
of features image forensic networks rely on with respect to those used in computer 
vision applications, however a definitive answer to this question is not known yet. 
With the above ideas in mind, the goal of this chapter is to give a brief introduction 
to adversarial examples and illustrate their applicability to image forensics networks. 
Particular attention is given to the transferability of the examples, since it is mainly 
thanks to this property that adversarial examples can be exploited to develop practical 
attacks against image forensic detectors. We also describe some possible defenses 
against adversarial examples, even if, up to date, a definitive strategy to make the 
creation of adversarial examples impossible, or at least impractical, is not available. 


16.2 Adversarial Examples in a Nutshell 


In this section we briefly review the various approaches proposed so far to generate 
adversarial examples, without referring specifically to a multimedia forensics sce- 
nario. After a rigorous formalization of the problem, we review the main approaches 
proposed so far, then we consider the generation of adversarial examples in the phys- 
ical domain. Eventually, we discuss the problems associated with the generation of 
adversarial attacks in a black-box setting. Throughout the rest of this chapter we will 
always refer to the case of deep neural networks aiming at classifying an input image 
into a number of possible classes, since this is by far the most relevant setting in a 
multimedia forensic context. 
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16.2.1 Problem Definition and Review of the Most Popular 
Attacks 


Let x € [0, 1]” denote a clean image! whose true label is y and let ó be a CNN- 
based image classifier providing a correct classification of x, namely, d(x) = y. 
An adversarial example is a perturbed image x’ = x + ó for which d(x’) Z y. More 
specifically, an untargeted attack aims at generating an adversarial example for which 
g(x’) = y’ for any y' Z y, while for a targeted attack the wrong label y’ must corre- 
spond to a specific target class y,. 

In practice, we require that 6 is a very small quantity, so that the perturbed image 
x’ is perceptually identical to x. Such a goal is usually achieved by constraining the 
Lp norm of ó to be lower than a small positive quantity, that is ||6||, < £, where € is 
a very small value. 

Starting from the seminal work by Szegedy, many approaches to generate the 
adversarial perturbation ó have been proposed, leading to a wide variety of different 
algorithms an attacker can rely on. The proposed approaches include gradient-based 
attacks (Szegedy et al. 2014; Goodfellow et al. 2015; Kurakin et al. 2017; Madry 
et al. 2018; Carlini and Wagner 2017), transfer-based attacks (Dong et al. 2018, 
2019), decision-based attacks (Chen et al. 2017; Ilyas et al. 2018; Li et al. 2019), 
and so on. Most methods assume that the attacker has a full knowledge of the to- 
be-attacked model ¢ a situation usually referred to as a white-box attack scenario. It 
goes without saying that this may not be the case in practical applications for which 
more flexible black-box or gray-box attacks are needed. We will discuss the main 
difficulties associated with black-box attacks in Sect. 16.2.2. 

In the following we review some of the most popular gradient-based white-box 
attacks. 


e L-BFGS Attack. In Szegedy et al. (2014), Szegedy et al. first proposed to generate 
adversarial examples aiming at fooling a CNN-based classifier and minimizing the 
perturbation of the image at the same time. The problem is formulated as 


minimize || ó I 
subject to d(x + ô) = y, | (16.1) 


x +ô € [0, 1)” 


Here, the perturbation introduced by the attack is limited by using the squared 
Lo norm. Solving the above minimization is not an easy task, so the adversarial 
examples are looked for by solving more manageable problem: 


minimize  - B + L(x + ó, yi) 
3 (16.2) 
subject to x + ó € [0, 1]” 


' Hereafter, we assume that image pixels assume values in [0, 1]. For gray-level images n is equal 
to the number of pixels, while for color images n accounts for all the color channels. 
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where L(x + 6, y;) is a loss term forcing the CNN to assign the target label y, 
to x + ó and À is a parameter balancing the importance of two terms. In Szegedy 
et al. (2014), the usual cross-entropy loss is adopted as L(x + ó, y,). The latter 
optimization problem is solved by the box-constraint L-BFGS method. Moreover, 
the optimal value of A is determined by conducting a binary search within a pre- 
defined range of values. 

Fast Gradient Sign Method (FGSM). A drawback of the L-BFGS method is its 
computational complexity. For this reason, Goodfellow et al. (2015) proposed a 
faster attack. The new method starts from the observation that, for small values 
of the perturbation, the output of the CNN can be computed by using a linear 
approximation, and that, due to the large dimensionality of x, a small perturbation 
of the input in the direction of gradient can result in a large modification of the 
output. Accordingly, the adversarial examples are generated by adding a small 
perturbation to the clean image as follows: 


x’ =x+e-sign(V,L(x, y)) (16.3) 


where V, is the gradient of £ with respect to x, and the adversarial perturba- 
tion is constrained by ||d||,, < £. Using the sign of the gradient, ensures that the 
perturbation is within the allowed limits. Since the adversarial examples are gen- 
erated based on the gradient of the loss function with respect to the input image x, 
they can be efficiently calculated by using the back-propagation algorithm, thus 
resulting in a very efficient attack. 

Iterative FGSM (I-FGSM). The FGSM attack is a one-step attack whose strength 
can be modulated only by increasing the value of £, with the risk that the linearity 
approximation the method relies on is no more valid. In Kurakin et al. (2017), the 
one-step FGSM is extended to an iterative, multi-step, method by applying FGSM 
multiple times with a small step size, each time by recomputing the gradient. The 
perturbed pixels are then clipped in an e—neighborhood of the original input, to 
ensure that the distortion constraint is satisfied. The N-th step of the resulting 
method, referred to as I-FGSM (or sometimes BIM—Basic Iterative Method), is 
defined by 


Xo =X; x N = X + clip.{a - sign(V,L(xy_1, y))} (16.4) 


where a is a small step size used to update the attack at each iteration. Similarly to I- 
FGSM, a universal first-order adversary, called projected gradient descent (PGD), 
has been proposed in Madry et al. (2018). Compared with I-FGSM, in which the 
starting point of the iterations is exactly the input image, the starting point of PGD 
is obtained by randomly projecting the input image within an allowed perturbation 
ball, which can be done by adding small random perturbations to the input image. 
Then, the following iterations are conducted in the same way as I-FGSM. 

Jacobian-based Saliency Map Attack (JSMA). According to the JSMA algo- 
rithm proposed by Papernot et al. in (2016), the adversarial examples are generated 
on the basis of an adversarial saliency map, which indicates the contribution of the 
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image pixels to the classification result. In particular, the saliency map is computed 
based on the forward propagation of the pixel values to the output of the to-be- 
attacked network. According to the saliency map, at each iteration, a pair of pixels 
whose modification leads to a significant increase of the output values assigned 
to the target class and/or a decrease of the values assigned to the other classes are 
perturbed by an amount 0. The iterations stop when the perturbed image is classi- 
fied as belonging to the target class or a maximum distortion level is reached. To 
keep the perturbation imperceptible, each pixel can be modified up to a maximum 
of T times. 

Carlini & Wagner Attack (C&W). In Carlini and Wagner (2017), Carlini and 
Wagner proposed to solve the optimization problem in (16.2) in a more efficient 
way. Specifically, two modifications of the basic optimization algorithm are pro- 
posed. On one hand, the loss term is replaced by a properly designed function 
directly related to the maximum difference between the logit of the target class 
and those of the other classes. On the other hand, the box-constraint used to 
keep the perturbed pixels in a valid range is automatically satisfied by letting 
ô = z (tanh(w) + 1) — x, where w is a new variable used for the transformation 
of ó. In this way, the box-constraint optimization problem in (16.2) is transformed 
into the following unconstrained problem: 


minimize | 2 (tanh(w) +1)— x| +à. gG (tanh(w) + 1)) (16.5) 
where the function g is defined as 


g(x") = max(max zi(x') — z,(x’), —K), (16.6) 


and where z; is the logit corresponding to the i—th class, and the parameter k > 0 
is used to encourage the attack to generate adversarial examples classified with 
high confidence. Finally, the adversarial examples are generated by solving the 
modified optimization problem using the Adam optimizer. 


16.2.2 Adversarial Examples in the Physical Domain 


The algorithms described in the previous sections work in the digital domain, since 
they directly modify the digital images fed to the CNN. In many applications, how- 
ever, it is required that the adversarial examples are generated in the physical domain, 
producing real-world objects that, when sensed by the sensors of the attacked system 
and fed to the CNN, cause a misclassification error. In this setting, and by focusing 
on the case of still images, which is the most relevant case for multimedia foren- 
sics applications, the images with the adversarial examples must first be printed or 
displayed on a screen and then shown to a camera, which will digitize them again 
before feeding them to the CNN classifier. This process, sometimes referred to as 
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Fig. 16.1 Physical domain and digital domain attacks 


image rebroadcast, involves a digital-to-analog and an analog-to-digital conversions 
that may degrade the adversarial perturbation thus making the attack ineffective. The 
rebroadcast process involved by a physical domain attack is exemplified in Fig. 16.1. 

As shown in the figure, during the rebroadcast process, the image is affected by 
several forms of degradation including noise addition, light changes and geometric 
transformations. If no countermeasure is taken, it is very likely that the invisible 
perturbations introduce to generate the adversarial examples are damaged up to a 
point to make the attack ineffective. 

Recognizing the above difficulties, several methods have been proposed to gener- 
ate physical domain adversarial examples. Kurakin et al. first brought up this concept 
in Kurakin et al. (2017). In this paper, the new images taken after rebroadcasting are 
geometrically calibrated before being fed to the classifier. In this way, the adversar- 
ial perturbation is not affected by any geometric distortion, hence easing the attack. 
More realistic scenarios are considered in later works. In the seminal paper by Atha- 
lye et al. (2018) a method is proposed to generate physical adversarial examples 
that are robust to various distortions, including changes of the viewing angle and 
distance. Specifically, a procedure, named Expectation Over Transformation (EOT), 
is applied to generate the adversarial examples by ensuring that they are effective 
across a number of possible transformations. 

Let x be the to-be-attacked image, with ground truth label y, and let y; be target 
class of the attack. Without loss of generality, let the output of the classifier be 
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obtained by computing the probability P(y;|x) that the input image x belongs to 
class y; and then choosing for the class resulting in the maximum probability,” 
namely, 

(x) = arg max P (y; |x). (16.7) 


The generation of adversarial examples is formalized as an optimization problem 
that maximizes the likelihood of the target class y, over an €-radius ball around the 
original image x, i.e., 


arg max log P(x; |x + ó) (16.8) 
subject to ||d||, < €. (16.9) 


To ensure robustness of the attack, EOT considers a set T of transformations 
(usually consisting of geometric transformations) and a probability distribution Pr 
over T, and maximizes the likelihood of the target class averaged over a transformed 
version of the input images, generated according to Pr. A similar strategy is applied 
to the constraint, i.e., EOT constrains the expected distance between the transformed 
versions of the adversarial example and the original image. As a result, with EOT 
the optimization problem is reformulated as 


arg max Ep, [log P(y,|t (x + ó))] (16.10) 


subject to Ep, [||t(x + ó) — tœ)llp] (16.11) 


where t is a transformation in T and Ep, denotes expectation over T. The basic 
EOT approach has been extended in many ways, including a wider variety of trans- 
formations and by optimizing the perturbation over more than a single image. For 
instance, Sharif et al. (2016) have proposed a way to generate adversarial examples 
in the physical domain for face recognition systems. They restrict the perturbation 
within a small area around the eye-glass frame and increase the robustness of the 
perturbations by optimizing them over a set of properly chosen images. Eykholt 
et al. (2018) have proposed an attack capable of generating adversarial examples 
under even more realistic conditions. The attack targets a road sign classification 
system possibly deployed on board of autonomous vehicles. In addition, to sample 
the images from synthetic transformations as done by the basic EOT attack, they also 
generated experimental data containing actual physical condition variability. Their 
experiments show that the generated adversarial examples are so effective that can 
be printed on stickers and applied to road signs, and fool a vehicular traffic sign 
recognition system in driving-by tests. 

Zhang et al. (2020) further enlarged the set of transformations to implement a phys- 
ical domain attack against a face authentication system equipped with a spoofing- 


2 This is the common case of CNN classifiers adopting one-hot-encoding to represent the result of 
the classification. 
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detection module, whose aim is to distinguish real faces and rebroadcast face photos. 
This situation presents an additional challenge, as explained in the following. To 
attack the face authentication system in the physical world, the attacker must present 
the perturbed images in front of the system by printing them or showing them on 
a screen. The rebroadcasting procedure, then, introduces new physical traces into 
the image fed to the classifier, which, having been trained to recognize rebroadcast 
traces, can recognize them and classify the images correctly despite the presence of 
the adversarial perturbation. To exit this apparent deadlock, the authors propose to 
utilize the EOT framework to design an attack that pre-emptively takes into account 
the new traces introduced by the rebroadcast procedure. The experiments shown in 
Zhang et al. (2020) demonstrate that EOT effectively allows to carry out an adver- 
sarial attack even in such difficult conditions. 


16.2.3 White Versus Black-Box Attacks 


An adversary may have different levels of knowledge about the algorithm used by the 
forensic analyst. From this point of view, adversarial examples in DL are commonly 
classified as white-box and black-box attacks (Yuan et al. 2019). In a white-box 
scenario, the attacker knows everything about the target forensic classifier, that is, 
he knows all the details of %(:). In a black-box scenario, instead, the attacker knows 
nothing about it, or, has only a very limited knowledge. In some works, following the 
taxonomy in Zheng and Hong (2018), the class of gray-box attacks is also considered, 
referring to a situation wherein the attacker has a partial knowledge of the classifier, 
for instance, he knows the architecture of the classifier, but has no knowledge of its 
internal parameters (e.g., he does not know some of the hyper-parameters or, more 
in general, he does not know the training data D,, used to train the classifier). 

In the white-box scenario, the attacker is also assumed to know the defence strategy 
adopted by the forensic analyst #(-), hence he can take into account the presence 
of such a defence during the attack. Having full access to @(-), the most common 
white-box attacks rely on some form of gradient-descent (Goodfellow et al. 2015; 
Kurakin et al. 2017; Dong et al. 2018), or directly solve the optimization problem 
underlying the attack as in Carlini and Wagner (2017); Chen et al. (2018). 

When the attacker has a partial knowledge of the algorithm (gray-box setting), 
he can build its own version of the classifier ¢’(-), usually referred to as substitute 
or surrogate classifier, providing a hopefully good approximation of ¢(-), and then 
use ¢'(-) to carry out the attack. Hence, the attacker implements a white-box attack 
against $'(-), hoping that the attack will also work against ¢(-) (attack transferability, 
see Sect. (16.3.1)). Mathematically speaking, the difference between white-box and 
black-box attacks is the loss function that the adversary seeks to maximize, namely 
L(ġ(-), -) in the white-box case and £(¢’(-), -) in the black-box case. As long as an 
attack thought to work against #(-) can be transferred to ¢’(-), adversarial samples 
crafted for ¢’(-) will also be misclassified by #(-) (Papernot et al. 2016). The substitute 
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classifier Z'(-) is built by exploiting the available information and making an educated 
guess about the unknown parameters. 

Black-box attacks often assume that the target network can be queried as an 
oracle, for a limited number of times, to get useful information to build the substitute 
classifier (Liu et al. 2016). When the score values or the logit values z; can be 
obtained by querying the target model (score-based black-box attacks), the gradient 
can be estimated by drawing some random samples and acquiring the corresponding 
loss values (Ilyas et al. 2018). A more challenging scenario is when only hard-label 
predictions, i.e., the predicted classes, are returned by the queried model (Brendel 
et al. 2017). 

In the DL literature, several approaches have been proposed to craft adversarial 
examples against substitute models, in such a way to maximize the probability that 
they can be transferred to the original model. One approach is the translation-invariant 
(Dong et al. 2019) attack method, that, instead of optimizing the objective function 
directly on the perturbed image, optimizes it over a set of shifted versions of the 
image. Another closely related approach improves the transferability of adversarial 
examples by creating diverse input patterns (Xie et al. 2019). Instead of only using 
the original images to generate adversarial examples, the method applies image some 
transformations to the inputs with a given probability at each iteration of the gradient- 
based attack. The transformations include resizing and padding. 

A general and simple method to increase the transferability of adversarial exam- 
ples is described in Sect. 16.3.2. 


16.3 Adversarial Examples in Multimedia Forensics 


Given the recent trend toward the use of deep learning architectures in Multimedia 
Forensics, several Counter-Forensic (CF) attacks against deep learning models have 
been developed in the last years. In this case, one of the main advantages of CNN- 
based classifiers and detectors, namely, their ability to learn forensic features directly 
from the input data (be their images, videos, or audio tracks), turns into a weakness. 
An informed attacker can in fact generate his attacks directly in the sample domain 
without the need to map them from the feature domain to the sample domain, as it 
happens with conventional methods based on hand-crafted features. 

The vulnerability of multimedia forensics tools based on deep learning has been 
addressed by many works as summarized in Marra et al. (2018). The underlying 
motivations for the vulnerability of CNNs used in multimedia forensics are basically 
the same characterizing image recognition applications, with a prominent role played 
by the huge dimension of the space of possible inputs, which is substantially larger 
than the set of images used to train the model. As a result, in a white-box scenario, the 
attacker can easily create slightly modified images that fall into an ‘unseen’ region 
of the image space thus forcing a misclassification error. Such a weakness of deep 
learning-based approaches represents a serious threat in multimedia forensics where 
the tools are designed to work under intrinsically adversarial conditions. 
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Adversarial examples that have been developed against CNNSs in multimedia 
forensics are reviewed in the following. 

The first targeted attack based on adversarial examples was proposed in Guera 
et al. (2017) to fool a CNN-based camera model identification system. By relying 
on adversarial examples, the authors propose a counter-forensic method for slightly 
altering an image in such a way to change the estimated camera model. A state- 
of-the-art CNN-based camera model detector based on the DenseNet architecture 
(Huang et al. 2017) is considered in the analysis. The counter-forensic method uses 
both the FGSM attack and the JSMA attack to craft adversarial images obtaining 
good performance in both cases. 

Adversarial attacks against CNN-based image forensics methods have been 
derived in Gragnaniello et al. (2018) for several common manipulation detection 
tasks. The following common processing has been considered: image blurring, 
applied with different variance of the Gaussian filter; image resizing, both down- 
scaling and upscaling; JPEG compression, with different qualities; and median fil- 
tering detection, with window sizes 3 x 3 and7 x 7. Several CNN models have been 
successfully attacked by means of adversarial perturbations obtained via the FGSM 
algorithm applied with different strengths =. 

Adversarial perturbations have been shown to offer poor robustness to image 
processing operations, e.g., image compression (Marra et al. 2018). In particular, 
rounding to integers is sometimes sufficient to wash out the perturbation and make an 
adversarial example ineffective. A gradient-inspired pixel domain attack to generate 
adversarial examples against CNN-based forensic detectors in the integer domain 
has been proposed in Tondi (2018). In contrast to standard gradient-based attacks, 
which perturb the attacked image based on the gradient of the loss function w.r.t. 
the input, in the attack proposed in Tondi (2018), the gradient of the output score 
function is approximated with respect to the pixel values, incorporating the constraint 
that the input is an integer in the optimization. The performance of the integer-based 
attack has been assessed against three image manipulation detectors, for median 
filtering, resizing, and contrast adjustment. Moreover, while common gradient-based 
adversarial attacks, like FGSM, I-FGSM, C&W, etc. ..., work in a white-box setting, 
the attack proposed in Tondi (2018) is a black-box one, since it does not need any 
knowledge of the internal parameters of the model, requiring only that the network 
can be queried as an oracle and the output observed. 

All the above attacks are targeted to a specific CNN classifier. Carrying out an 
effective attack in a completely black-box or no-knowledge scenario, that is, gen- 
erating an attack that can be effectively transferred to the (unknown) model used 
by the analyst, turns out to be a difficult task, that is exacerbated by the complex- 
ity of the decision functions learnt by neural networks. For this reason, and given 
that white-box attacks can be hardly applied in practical applications, studying the 
transferability of adversarial examples plays a major role to assess the security of 
DL-based multimedia forensics tools. 
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16.3.1 Transferability of Adversarial Examples in 
Multimedia Forensics 


It has been demonstrated in several studies concerning computer vision applications 
(Tramér et al. 2017; Papernot et al. 2016) that, in a gray-box or black-box scenario, 
adversarial examples crafted to attack a given CNN model are also effective against 
other models designed for the same task. In this section, we focus on the transfer- 
ability of adversarial examples in an image forensics context and review the main 
results reported so far regarding the transferability of adversarial examples in image 
forensics. 

Several works have investigated the transferability of adversarial examples in a 
network mismatch scenario (Marra et al. 2018; Gragnaniello et al. 2018). Focusing 
on camera model identification, in Marra et al. (2018), Marra et al. analyzed the 
transferability of adversarial examples generated by means of FGSM among differ- 
ent networks. Four different network architectures for camera model identification 
trained on the VISION dataset (Shullani et al. 2017) are utilized for experiments, i.e., 
Shallow CNN (Bondi et al. 2017), DenseNet-40 (Huang et al. 2017), DenseNet-121 
(Huang et al. 2017), and XceptionNet (Chollet 2017). The classification accuracy 
achieved by these models is 80.77% for Shallow CNN, 87.96% for DenseNet-40, 
93.88% for DenseNet-121, and 95.15% for XceptionNet. The transferability perfor- 
mance for the case of cross-network mismatch is shown in Table 16.1. As it can be 
seen, the classification accuracy on attacked images remains pretty high when the 
network used to build the adversarial examples is not the same targeted by the attack, 
thus proving a poor transferability of the attacks. 

A more comprehensive investigation of the transferability of adversarial examples 
in different mismatch scenarios has been carried out in Barni et al. (2019). As in Barni 
et al. (2019), in the following we report the transferability of adversarial examples 
in various settings, according to the kind of attack used to create the examples and 
the mismatch existing between the source and the target models. 


Table 16.1 Classification accuracy (%) for cross-network mismatch in the context of camera model 
identification. The FGSM attack is used 


Source network 


Shallow CNN | DenseNet-40 | DenseNet-121 | XceptionNet 
TN Shallow CNN | 2.19 40.87 42.13 42.51 
DenseNet-40 | 37.23 3.88 37.13 53.42 
DenseNet-121 | 62.81 60.60 4.67 44.35 
XceptionNet | 68.19 79.80 57.21 2.63 


446 M. Barni et al. 
Type of attacks 


We consider the transferability of attacks generated according to two of the most pop- 
ular general attacks proposed so far, namely I-FGSM and JSMA (see Sect. 16.2.1 for 
a detailed description of such attacks). The attacks were implemented by relying on 
the Foolbox package (Rauber et al. 2017). 


Source/target model mismatch 


To carry out a comprehensive evaluation of the transferability of adversarial exam- 
ples, three different mismatch situations between the network used to create the adver- 
sarial examples (hereafter referred to as source network—SN) and the network the 
adversarial examples should be transferred to (hereafter named target network—TN) 
are considered. Specifically, we consider the following situations: (i) cross-network 
mismatch, wherein different network architectures are trained on the same dataset; 
(ii) cross-training-set mismatch, according to which the same network architecture 
is trained on different datasets; and (iii) cross-network-and-training-set mismatch, 
wherein different architectures are trained on different datasets. 


Network architectures 


For cross-network mismatch, we consider BSnet (Bayar and Stamm 2016), designed 
for the detection of a number of widely used image manipulations, and BC+net (Barni 
et al. 2018) proposed to detect generic contrast adjustment operations. In particular, 
BSnet consists of 3 convolutional layers, 3 max-pooling layers, and 3 fully-connected 
layers, and the filters used in the first convolutional layer are constrained to extract 
residual-based features from images. As for BC+net, it has 9 convolutional layers, 
and no constraint is applied to the filters used in the network. The results reported in 
the following refer to two common image manipulations, namely, median filtering 
(with window size 5 x 5) and image resizing (by a factor of 0.8). 


Datasets 


For cross-training-set mismatch, we consider two datasets: the RAISE (R) dataset 
(Dang-Nguyen et al. 2021) and the VISION (V) dataset (Shullani et al. 2017). More 
in details, RAISE consists of 8156 high-resolution uncompressed images with size 
4288 x 2848 in RAISE, 2000 of which were used for the experiments. With regard 
to the VISION dataset, it contains 11,732 JPEG images, with 2000 images used 
for experiments, with a minimum resolution equal to 2336 x 4160 and a maximum 
one of 3480 x 4640. Results refer to JPEG images compressed with a quality factor 
larger than 97. The images were split into training, validation, and test sets without 
overlap. All color images were first converted to gray-scale. 
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Table 16.2 Detection accuracy (%) of the models in the absence of attacks 


Model Median filtering Image resizing 
PBsnet PBsnet PBCanet PBsnet PBsnet PBC4net 
Accuracy | 98.1 99.5 98.4 97.5 96.6 98.5 


Experimental methodology 


Before evaluating the transferability of adversarial examples, the detection models 
were first trained. The input patch size was set to 128 x 128. To train the BSnet 
models, 200,000 patches per class were considered while 10,000 patches were used 
for testing. As to BC+net models, 10° patches were used for training, 10° for vali- 
dation, and 5 x 10* for testing. The Adam optimizer was used with a learning rate 
of 1074 for both networks, and the training batch size was set to 32 patches. BSnet 
was trained for 30 epochs and BC+net for 3 epochs. The detection accuracies of 
the models are given in Table 16.2. For sake of clarity, the symbol #28 is used to 
denote the trained models, where net € {BSnet, BC+net} corresponds to the network 
architecture, and DB € (R, V} is the dataset used for training. 

In counter-forensic applications, it is reasonable to assume that the attacker is 
interested only in passing off the manipulated images as original ones to hide the 
traces of manipulations. Therefore, for all the experiments, 500 manipulated images 
were attacked to generate the adversarial examples. 

For I-FGSM, the number of iterations was set to 10, and the optimal attack strength 
used in each iteration was determined by spanning the range [0 : =, : 0.1], where =, 
was set to 0.01 and 0.001 in different cases. As for JSMA, the perturbation strength 
0 was set to 0.01 and 0.1. For each attack, two different attack strengths were con- 
sidered. The average PSNR of the adversarial examples is always larger than 40 dB. 


Transferability results 


For each setting, we show the average PSNR of the adversarial examples, and the 
attack success rate on the target network, ASR+s, which corresponds to the trans- 
ferability degree (the attack success rate on the source network—ASRsn—is always 
close to 100%, hence it is not reported in the tables). 

Table 16.3 reports the results regarding cross-network transferability. All models 
were trained on the RAISE dataset. In most of the cases, the adversarial examples 
are not transferable, indicating a likely failure of the attacks in a black-box scenario. 
The only exception is median filtering when the attack is carried out by I-FGSM, for 
which we have ASRry = 82.5% with an average PSNR of 40 dB. 

With regard to cross-training-set transferability, the BSnet network was trained 
on the RAISE and the VISION datasets, obtaining the results shown in Table 16.4. 
For the cases of median filtering, the values of ASR+s are relatively high for -FGSM 
with =, = 0.01, while the transferability is poor with the other attacks. With regard 
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Table 16.3 Attack transferability for cross-network mismatch 
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= øR — AR 
SN = PBsnet* TN= PBC4net 


Attack Parameter Median filtering Image resizing 
ASRrTN % PSNR (dB) ASR In % PSNR (dB) 
I-FGSM Es = 0.01 82.5 40.0 0.2 40.0 
€s = 0.001 18.1 59.7 0.2 58.5 
JSMA 0=0.1 1.0 49.6 1.6 46.1 
0 = 0.01 1.6 58.5 0.6 55.0 


Table 16.4 Transferability of adversarial examples for cross-training-set mismatch 


= AR s= AV. 
SN = PBsnet* TN= PBsnet 


Attack Parameter Median filtering Image resizing 
ASRtn % PSNR (dB) ASRrTN % PSNR (dB) 
I-FGSM és = 0.01 84.5 40.0 69.2 40.0 
€s = 0.001 4.5 59.7 4.9 58.5 
JSMA 0=0.1 1.2 49.6 78.2 46.0 
0 = 0.01 0.2 58.5 11.5 55.0 
SN = PBSner TN= PBsnet 
Attack Parameter Median filtering Image resizing 
ASR tn % PSNR (dB) ASR In % PSNR (dB) 
I-FGSM és = 0.01 94.1 40.0 0.2 40.0 
és = 0.001 hal 59.9 0.0 59.6 
JSMA 0=0.1 1.0 49.6 0.0 50.6 
0 = 0.01 0.8 58.1 0.0 57.8 


to image resizing, when the source network is trained on RAISE, the ASR+sN reaches 
a high level when the stronger attacks are employed (IFGSM with £s = 0.01, and 
JSMA with 0 = 0.1). However, the adversarial examples are never transferable when 
the SN is trained on VISION, showing that the transferability between models trained 
on different datasets is not symmetric. 

The results for the case of strongest mismatch (cross-network-and-training-set) 
are illustrated in Table 16.5. Only the case of median filtering attacked by I-FGSM 
with =, = 0.01 achieves a large transferability, while for all the other cases the trans- 
ferability is nearly null. 

In summary, in contrast to computer vision applications, adversarial examples 
targeting image forensics networks are generally non-transferable. Accordingly, from 
the attacker’s side, it is important to investigate if the attack transferability can be 
improved by increasing the attack strength used to generate adversarial examples 
lying deeper inside the target region of the attack. 
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Table 16.5 Attacks transferability for cross-network-and-training-set mismatch 
= V = AR 
SN = PBsnet? TN= PBC4net 


Attack Parameter Median filtering Image resizing 
ASR tn % PSNR (dB) ASR In % PSNR (dB) 
I-FGSM és = 0.01 79.6 40.0 0.4 40.0 
és = 0.001 0.8 59.9 0.2 59.6 
JSMA 0=0.1 0.8 49.6 0.0 50.2 
0 = 0.01 1.2 58.1 0.0 57.4 


16.3.2 Increased-Confidence Adversarial Examples with 
Improved Transferability 


The lack of transferability of adversarial examples in image forensic applications is 
partly due to the fact that, in most implementations, the attacks aim at generating 
adversarial examples with minimum distortion. Consequently, the resulting adver- 
sarial examples are close to the decision boundary and a small difference between 
the decision boundaries of the source and target networks can lead to the failure of 
the attack on the TN. To overcome this problem, the attacker may want to generate 
adversarial examples lying deeper into the target region of the attack. However, given 
the complexity of the decision boundary learned by CNNs, it is hard to control the 
distance of the adversarial examples from the boundary. In addition, increasing the 
attack strength by simply going on with the attack iterations until a limit value of 
PSNR is reached may not be effective, since a lower PSNR does not necessarily 
result in a stronger attack with higher transferability (see Table 16.3). 

In order to generate adversarial examples with higher transferability that can be 
used for counter-forensics applications, a general strategy consists in increasing the 
confidence of the misclassification, where the confidence is defined as the maximum 
difference between the logit of the target class and those of the classes (Li et al. 
2020). Specifically, by focusing on the binary case, for a clean image x with label 
y =i (i = Q, 1), a perturbed image x’ is looked for which z1_; — z; > c, where 
c > 0 is the desired confidence value. By implementing the above stop condition, 
the most popular attacks (such as I-FGSM and PGD) can be modified to generate 
adversarial examples with increased confidence. 

In the following, we evaluate the transferability of increased-confidence adver- 
sarial examples by adopting a methodology similar to that used in Sect. 16.3.1. 


Attacks 


We report the results obtained by applying the increased-confidence strategy to four 
popular attacks, namely, I-FGSM, Momentum-based I-FGSM (MI-FGSM) (Dong 
etal. 2018), PGD, and C&W attack. In particular, the MI-FGSM is a method proposed 
explicitly to improve the transferability of the attacks by mitigating the momentum 
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term in each iteration of the I-FGSM with a decay factor p. 
Networks 


Three network architectures are considered in the cross-network experiments, that is: 
BSnet (Bayar and Stamm 2016), BC+net (Barni et al. 2018), and VGG-16 network 
(VGGnet) commonly used in computer vision applications (VGGnet consists of 13 
convolutional layers and 3 fully connected layers, more information can be found 
in Simonyan and Zisserman (2015)). With regard to the image manipulation tasks, 
we report results referring to median filtering, image resizing, and addition of white 
Gaussian noise (AWGN) with zero mean and unitary standard deviation. 


Datasets 


The experiments on cross-training-set mismatch are based on the RAISE and VISION 
datasets. 


Experimental setup 


BSnet and BC+net were trained by using the same setting described in Sect. 16.3.1. 
With regard to VGGnet, 10° patches were used for training and validation, and 104 
patches were used for testing. The models were trained for 50 epochs by using the 
Adam optimizer with a learning rate of 1075. The range of detection accuracies in 
the absence of attacks are [98.1, 99.5%] for median filtering detection, [96.6, 99.0%] 
for image resizing, and [98.3, 99.9%] for AWGN detection. 

The attacks were applied to 500 manipulated images for each task. The Fool- 
box toolbox (Rauber et al. 2017) was employed to generate increased-confidence 
adversarial examples with the new stop condition. Specifically, for C&W attack, 
all the parameters were set to their default values. For PGD, the binary search was 
conducted with initial values of € = 0.3 and a = 0.005, for 100 iterations. For I- 
FGSM, we also applied 100 steps, with the optimal stepsize determined in the range 
[0.001 : 0.001 : 0.1]. Eventually, for MI-FGSM, all the parameters were set to the 
same values of I-FGSM except the decay factor, for which we let u = 0.2. 


Transferability results 


In the following we show some results demonstrating the improved transferability 
of adversarial examples with increased confidence. For each attack, we report the 
ASRrs and the average PSNR. The ASRsn is always close to 100% and hence it is 
omitted in the tables. For the confidence value c that depends on the logits of the SN, 
different values were chosen for different networks. 

Table 16.6 reports the results for cross-network mismatch, showing the transfer- 
ability when (SN,TN) = (Roanet: Pksnet) for three different image manipulations. 
As expected, by using an increased confidence, the transferability of the adversarial 
examples improves significantly in all these cases. Specifically, the ASR+sx reaches a 
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Table 16.6 Attacks transferability for cross-network mismatch (SN = QV GGnet TN = PRonet)* The 
ASR tn (%) and average PSNR (dB) of adversarial examples is reported 


Median filtering 


MI-FGSM 
ASRtn_ | PSNR ASRtn |PSNR | ASRrn (PSNR | ASRtn | PSNR 
21.2 58.1 
Image resizing 
c I-FGSM MI-FGSM PGD C&W 
ASRrN | PSNR ASRtn I PSNR | ASRrn |PSNR | ASRrn_ | PSNR 
0 0.4 59.3 0.4 59.2 1.8 75.4 1.2 71.5 
17 25.0 33.3 24.8 33.3 22.0 33.4 40.4 36.8 
18 39.6 30.9 37.4 30.9 39.8 30.9 53.6 34.3 
19 51.6 28.9 52.6 28.9 52.0 28.9 64.4 32.2 
AWGN 
c LFGSM MI-FGSM PGD C&W 
ASRrN | PSNR ASRrN | PSNR ASRrN | PSNR ASRrN | PSNR 
0 10.6 56.2 10.8 56.0 5.8 60.6 1.2 64.2 
10 40.4 52.1 41.4 51.9 20.6 54.8 19.2 57.8 
15 79.4 49.4 7718 49.2 50.8 51.4 52.0 53.8 
20 91.0 45.4 92.2 45.4 82.8 46.8 79.9 49.5 


maximum of 96.6, 64.4, and 92.2% for median filtering, image resizing, and AWGN, 
respectively. Moreover, a larger confidence always results in adversarial examples 
with higher transferability and larger image distortion, and different degrees of trans- 
ferability are obtained on different attacks. Among the four attacks, the C&W attack 
always achieves the highest PSNR for similar values of ASRqn. A noticeable observa- 
tion regards the MI-FGSM attack. While (Dong et al. 2018) reports that MI-FGSM 
attack improves the transferability of attacks against computer vision networks, a 
similar improvement of ASR+s is not observed in Table 16.6 (w.r.t. -FGSM). The 
reason could be that the gradients between subsequent iterations are highly correlated 
in image forensics models, and thus the advantage of gradient stabilization sought 
for by MI-FGSM is reduced. 

Considering cross-training-set transferability, the results for the cases of median 
filtering, image resizing, and AWGN addition are shown in Table 16.7. The transfer- 
ability of adversarial examples from BSnet trained on RAISE to the same architec- 
ture trained on VISION is reported. According to the table, increasing the confidence 
always helps to improve the transferability of the adversarial examples. Moreover, 
although the transferability is improved in all the cases, transferring adversarial exam- 
ples for the case of image resizing detection still tends to be more difficult, as a larger 
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Table 16.7 Attacks transferability for cross-training-set mismatch (SN: ÓEsnet TN: Óusnet)- The 
ASRrs (%) and the average PSNR (dB) of adversarial examples is reported 


Median filtering 


MI-FGSM 
ASRrN | PSNR | ASRrN I PSNR | ASRrn | PSNR _ ASRrTrN | PSNR 
4.8 59.7 
Image resizing 
c I-FGSM MI-FGSM PGD C&W 
ASRrN |PSNR | ASRrN (PSNR ( ASRrN |PSNR  ASRrTrN | PSNR 
0 12.8 58.7 12.8 58.6 9.8 66.9 9.8 68.3 
30 53.0 48.0 48.6 47.8 38.6 49.6 23.8 52.5 
40 64.0 44.7 60.2 44.6 54.2 45.7 32.8 48.9 
50 67.2 41.7 64.2 41.7 59.8 42.3 39.2 45.9 
AWGN 
c LFGSM MI-FGSM PGD C&W 
ASRrN |PSNR | ASRrn | PSNR | ASRrn | PSNR | ASRtn | PSNR 
0 0.6 57.7 0.6 57.6 0.2 62.4 0.2 65.0 
20 13.8 49.7 15.0 49.6 11.0 52.2 12.2 54.4 
30 54.4 44.8 72.8 45.1 77.2 47.3 78.0 49.4 
40 88.2 41.0 93.0 41.3 94.0 43.0 95.4 45.5 


Table 16.8 Attacks transferability for cross-network-and-training-set mismatch (SN = bNsnet? TN 
= Be net)» including ASR+TsN (%) and average PSNR (dB) of adversarial samples 


Median filtering 


MI-FGSM 
ASRrN | PSNR ASRtn_ | PSNR ASRrN | PSNR ASR In 
2.6 60.0 
85.6 42.4 


PSNR 
70.5 


Image resizing 


c I-FGSM MI-FGSM PGD C&W 

ASRtn_ | PSNR ASRtn_ | PSNR ASRtn_ | PSNR ASRtn_ | PSNR 
0 2.0 59.8 2.0 59.7 0.8 74.7 0.8 73.3 
400 37.4 31.2 33.4 31.1 40.0 31.2 17.0 33.6 


image distortion is needed to achieve a transferability comparable to that obtained 
for median filtering and AWGN addition. 
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Finally we consider the strongest mismatch case, i.e., cross-network-and-training- 
set mismatch, in the case of median filtering and image resizing detection. Only 
the attack transferability with c = 0 and that with largest confidences are reported 
in Table 16.8. A higher transferability is achieved (with a good image quality) for 
the case of median filtering, while adversarial examples targeting image resizing 
detection are less transferable. 

To summarize, in contrast to computer vision applications, using adversarial 
examples for counter-forensics in a black-box or gray-box scenario, requires that 
proper measures are taken to ensure the transferability of the attacks (Li et al. 2020). 
As we have shown by referring to the method proposed in Li et al. (2020), however, 
attack transferability can be ensured in a wide variety of settings, the price to pay 
being a slight increase of the distortion introduced by the attack, thus calling for the 
development of suitable defences. 


16.4 Defenses 


As a response to the threats posed by the existence of adversarial examples and 
by the ease with which they can be crafted, many defence mechanisms have been 
proposed to develop CNN-based forensic methods that can work in adversarial con- 
ditions. Generally speaking, defences can work in a reactive or proactive mannet. 
In the first case, the defence mechanisms are designed to work in conjunction with 
the original CNN algorithm ¢, e.g., by revealing whether an adversarial example is 
being fed to the CNN by means of a dedicated detector, or by mitigating the (pos- 
sibly present) adversarial perturbation applying some input transformations before 
feeding the CNN (this latter approach has been tested in the general DL literature— 
e.g., Xu et al. (2017)—but has not been adopted as a defence strategy in forensic 
applications). The second branch of defences aims at building more secure CNN 
models from scratch or by properly modifying the original models. The large major- 
ity of the methods developed in the ML and DL-based forensic literature belongs to 
this category. Among the most popular approaches, we mention adversarial training, 
multiple classification, and detector randomization. With regard to detector random- 
ization, we refer to methods wherein the detector is built by including within it some 
random elements, thus qualifying this approach as a proactive one. This contrasts 
with randomization techniques commonly adopted in DL literature, like, for instance, 
stochastic activation pruning (Dhillon et al. 2018), that randomly drops some neurons 
of the CNN during the forward pass. 

In the following, we review some of the methods developed in the forensic liter- 
ature to defend against adversarial examples. 
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16.4.1 Detect Then Defend 


The most common defence approach in early anti-counter-forensics works consisted 
in performing adversarial examples detection to rule out adversarial images, followed 
by the application of standard (unaware) detectors for the analysis of the samples 
deemed to be benign. 

The analyst, aware of the counter-forensic method the system may be subject to, 
develops a new detector ġ4 capable to expose the attacked images by looking for 
the traces left by the counter-forensic algorithm. This goal is achieved by resort- 
ing to new, tailored, features. The new algorithm %A is explicitly designed to reveal 
whether the image underwent a certain attack or not. If this is the case, the analyst 
may refuse the image or try to clean it to remove the effect of the attack. In this kind 
of approaches, the algorithm ¢, is used in conjunction with the original, unaware, 
algorithm ¢. Such a view is adopted in Valenzise et al. (2013); Zeng et al. (2014), to 
address the adversarial detection of JPEG compression and median filtering, when 
the attacker tries to hinder the detection by removing the traces of JPEG compression 
and median filtering. Among the examples for the case of model-based analysis, we 
mention the algorithm in Costanzo et al. (2014) against the keypoint removal and 
injection attack against copy-move detectors. The “detect then defend” approach has 
also been adopted recently in the case of CNN-based forensic detectors, to defend 
against adversarial examples. For instance, Carrara et al. (2018) proposes to tell apart 
adversarial examples from benign images by looking at their behavior in the feature 
space. The method focuses on the analysis of the trajectory of the internal repre- 
sentations of the network (i.e., the activation of the neurons of the hidden layers), 
arguing that, for adversarial inputs, the representations follow a different evolution 
with respect to the case of genuine (non-adversarial) inputs. Detection is achieved by 
defining a feature distance to encode these trajectories into a fixed length sequence, 
used to train an Long Short Term Memory (LSTM) neural network to discern adver- 
sarial inputs from genuine ones. 

Modeling normal data distributions for the activations has also been considered in 
the general DL literature to reveal abnormal behavior of adversarial examples (Tao 
et al. 2018). 

It goes without saying that, this kind of defences can be easily circumvented if 
the attacker has some information about the method used by the analyst to expose 
the attack, thus entering a never-ending loop where attacks and forensic algorithms 
are iteratively developed. 


16.4.2 Adversarial Training 


A simple, yet effective, approach to improve the robustness of machine learning 
classifiers against adversarial attacks is adversarial training. Adversarial training 
consists in augmenting the training dataset with examples of adversarial images. 
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Suchan approach implicitly assumes that the attack algorithm is known to the analyst 
and that the attack is not carried out on the retrained version of the detector. This 
identifies adversarial training as a white-box defense carried out against a gray-box 
attack. 

Let DA be the set of attacked images used for training. The adversarial trained 
classifieris ó(:, 7; Da U DA), where T = (ti, to, ...} denotes the set of all the hyper- 
parameters (i.e., the non-trainable parameters) of the network, including, for instance, 
the type of algorithm, its structure, the loss function, the internal parameters, the train- 
ing procedure, etc. ...Adversarial training has been widely adopted in the general DL 
literature to improve the robustness of DL classifiers against adversarial examples 
(Goodfellow et al. 2015; Madry et al. 2018). DL-based forensics is no exception, with 
the proposal of several approaches that resort to adversarial training to defend against 
adversarial examples (see, for instance, (Schottle et al. 2018; Zhao et al. 2018)). In 
machine-learning-based forensic literature, JPEG-aware training is often considered 
to achieve robustness against JPEG compression. The interest in JPEG compression 
is motivated by the fact that JPEG compression is a common post-processing oper- 
ation applied to images, either innocently (e.g., when uploading images on a social 
network), or maliciously (in some cases, in fact, JPEG compression can be consid- 
ered as a simple and effective laundering attack (Barni et al. 2019)). The algorithms 
in Barni et al. (2017, 2016), for double JPEG detection, and in Boroumand and 
Fridrich (2017) for a variety of manipulation detection problems, based on Support 
Vector Machines (SVM), are examples of this approach. Several examples have also 
been proposed more recently in CNN-based forensics, e.g., for camera model iden- 
tification (Marra et al. 2018), contrast adjustment detection (Barni et al. 2018), and 
GAN detection (Barni et al. 2020; Nataraj et al. 2019). 

Resizing is another common processing operator applied to images, either inno- 
cently or maliciously. In a splicing scenario, for instance, image resizing can be 
applied to delete the traces of the splicing operation. Accordingly, a resizing-aware 
detector can be trained to detect splicing in the presence of resizing. From a different 
perspective, resizing can also be used as a defence against adversarial perturbations. 
Like other geometric transformations, resizing can be applied to the input images 
before testing, so to disable the effectiveness of the adversarial perturbations (He 
et al. 2017). 


16.4.3 Detector Randomization 


Detector randomization is another defense strategy that has been proposed to make 
life difficult for the attacker. With regard to computer vision applications, many 
randomization-based approaches have been proposed to hinder the transferability 
of adversarial examples in gray-box or black-box scenarios, and hence improve the 
security of the applications (Xie et al. 2018; Dhillon et al. 2018; Taran et al. 2019; 
Liu et al. 2018). 
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A similar approach can also be employed for securing image forensics detectors. 
The method proposed in Chen et al. (2019) applies randomization in the feature space 
by randomly selecting a subset of features to be used by the subsequent forensic 
detector. Randomization is based on a secret key, unknown to the attacker, who, 
then, cannot gain full knowledge of the to-be-attacked system (thus being forced 
to operate in a gray-box attack scenario). In Chen et al. (2019), the effectiveness 
of random feature selection (RFS) is proven theoretically, under some simplifying 
assumptions, and verified experimentally for two detectors based on support vector 
machines. 

The approach proposed in Chen et al. (2019) has been extended to detectors based 
on deep learning by means of a techniques called Random Deep Feature Selection 
(RDFS) (Barni et al. 2020). In detail, a CNN-based forensic detector is first divided 
into two parts, i.e., the convolutional layers, playing the role of feature extractor, 
and the fully connected layers, to be regarded as the final detector. Let N denote the 
number of features extracted by the convolutional layers. To train a secure detector, 
a subset of K features is randomly selected according to a secret key. The fully 
connected layers are then retrained based on the selected K features. To maintain 
a good classification accuracy, the same model with the same secret key is applied 
during the training and the testing phases. In this case, the convolutional layers are 
only used for feature extraction and are frozen in the retraining phase to keep the 
feature space unchanged. 

Assuming that the attacker has no knowledge of the RDFS strategy, he will tar- 
get the original CNN-based model during his attack and the effectiveness of the 
randomized network will ultimately depend on the transferability of the attack. 

The effectiveness of RDFS has been evaluated in Barni et al. (2020). Specifically, 
the experiments were conducted based on the RAISE dataset (Dang-Nguyen et al. 
2021). Two different network architectures (BSnet (Bayar and Stamm 2016) and 
BC+net (Barni et al. 2018)) are utilized to detect three different image manipulations, 
namely, median filtering, image resizing, and adaptive histogram equalization (CL- 
AHB). 

To build the original models based on BSnet, for each class, a number of 100,000 
patches was considered for training, and 3000 and 10,000 patches were considered 
for validation and testing, respectively. As to the models based on BC+net, 500,000 
patches per class were used for training, 5000 for validation, and 10,000 for testing. 
For both networks, the input patch size was set to 64 x 64, and the batch size was 
set to 32. The Adam optimizer with learning rate of 1074 was used for training. 
The BSnet models were trained for 40 epochs and the BC+net models for 4 epochs. 
For BSnet, the classification accuracy of the models are 98.83, 91.30, and 90.45% 
for median filtering, image resizing, and CL-AHE, respectively. The corresponding 
accuracy values are 99.73, 95.05, and 98.30% for BC+net. 

To build the RDFS-based detectors, first, a subset of the original training set with 
20,000 patches (per class) was randomly selected. Then, for each model, the FC 
layers were retrained based on K features randomly selected from the full feature 
set. The Adam optimizer with learning rate of 1075 is utilized. An early stop condition 
was adopted with a maximum number of 50 epochs, and the training process will 
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stop when the validation loss changes less than 107° within 5 epochs. The number 
of K considered in the experiments are K = {5, 10, 30, 50, 200, 400} as well as the 
full feature with K = N. For each K, 50 models trained on randomly selected K 
features were utilized in the experiments. 

Three popular attacks were applied to generate adversarial examples, namely L- 
BFGS, I-FGSM and PGD. All the attacks were conducted based on Foolbox (Rauber 
et al. 2017). The detailed parameters of the attacks are given below. For L-BFGS, 
the parameters were set as default. -FGSM was conducted with 10 iterations, and 
the best strength was found by spanning the range [0 : 0001 : 0.1]. For PGD, a 
binary search was conducted with initial values of € = 0.3 and a = 0.05, to find the 
optimal parameters. An exception for PGD is that the parameters used for attacking 
the models for the detection of CL-AHE were set as € = 0.01 and a = 0.025 without 
using binary search. 

In the absence of attacks, the classification performance of the models trained on 
K features was evaluated on 4000 patches per class, while the defense performance 
was evaluated based on 500 manipulated images. 

Tables 16.9 and 16.10 show the classification accuracy for the cases of BSnet 
and BC-+net on three different manipulations. The RDFS strategy is helpful to hin- 
der the transferability of adversarial examples on both networks at the expense of a 
slight decrease of the classification accuracy on clean images (in absence of attacks). 
Considering the BSnet, with the decrease of K, the detection accuracy of adversar- 
ial examples increases. For instance, the gain of the accuracy reaches 20-30% for 
K = 30 and 30-50% for K = 10, while the classification accuracy on clean images 
decreases by only 2—4% in these two cases. A similar behavior can be observed 
in the cases of BC+net in Table 16.10. In some cases, the detection accuracy for 
K = N is already pretty high, which means that the adversarial examples built on 
the original CNN cannot transfer to the new detector trained on the full feature set. 
This phenomenon confirms the conclusion that adversarial examples have a poor 
transferability in the context of image forensics. 

Further research has been carried out to evaluate the effectiveness of RDFS against 
increased-confidence adversarial examples (Li et al. 2020). Both the original CNNs 
and the RDFS detectors utilized in Barni et al. (2020) were tested. Two different 
attacks were conducted, namely, C&W attack and PGD, to generate adversarial 
examples with confidence value c. Similarly to Li et al. (2020), different confidence 
values are chosen for different SN. The detection accuracies of the RDFS detectors for 
the median filtering and image resizing task in the presence of increased-confidence 
adversarial examples are shown in Tables 16.11 and 16.12. It can be observed that 
decreasing K also helps improving the defense performance of the RDFS detectors 
against increased-confidence adversarial examples. For example, considering the 
case of BSnet for median filtering detection, for adversarial examples with c = 10 
generated based on C&W attack, by decreasing K from N to 10, the detection 
accuracy is improved by 34.5%. However, with the increase of the confidence, the 
attack tends to be stronger, and the detection accuracy tends to be very low, even for 
small value of K. 


M. Barni et al. 


458 


6°69 919 9°S9 O16 8°6L OTS 0 çL L 988 VL8 C68 0°88 OEL ç 
0°89 L'SS 029 0°56 9°08 SYY TL9 T E6 0°88 168 9°8L O'8L 01 
ç 9ç ver 8'87 06 L6L 80E 19s 8°96 5°68 L'06 L'Y9 108 og 
O'TS OSE Tor t L6 0°08 ov S'ES L'L6 T06 C16 c 9ç L'09 Os 
OTE LSI YLI 8'46 9'LL 8°01 SV L86 9'16 0+6 gr C18 ooz 
LOT TL T6 L'L6 99L çL och 8°86 816 ç r6 0O'Ir E'IS 00r 
Cle 9°0 COT 0°86 618 cr L'6E 0°66 816 6°66 TSE ç'09 N 
ddd} IS541 SD548-1 ON d9d|} WSDAT| SOAd-T ON ddd} INS541 SOAd-T ON J 
Surzisəi Ə8putr SuLI9)[U UIP HHV-TO 
syorne Jo əouəsqg I} 0} spuodsərto2 ON, ‘PUSY uo paseg 1o]99]9p SAG Su JO (%) Aopinoov «GOT ALEL 


459 


16 Adversarial Examples in Image Forensics 


L'L9 L8S L09 VL c 8 © 87 FLL TL6 CLV Leg OLY VL8 S 
6IL 66S 0 c9 99, 1°98 c rr Ç 6L 8°86 9'çç 8°89 cç 87 TI6 Ol 
818 ç S9 LOL LC 6°88 oog 9'6L F 66 LOS COL 86E £6 og 
TSS 8°99 OEL 8°96 VL8 617 9°9OL 9°66 90S 0°08 CSE 1'S6 Os 
0°88 9°69 6 LL L'66 9°88 OLI COL 9°66 C87 0 c8 0'97 6'96 ooz 
£68 SIL 0°08 8°66 188 9ST 9°SL 9°66 Tog 9'c9 01 TL6 00t 
8°68 CSL TIS 001 TSS L'EI cil L66 Cee Ore TOIT T86 N 
ddd} WSDdT} SD548-1 ON d9d|} WSDAT| SOAd-T ON ddd} IS5441 SOAd-T ON J 


SUIZISOI Ə8pu[ 


SuuƏJ[U uerpoyy 


HHV- IO 


syorye Jo əouəsqg ət] 0} spuodsərro5 ,oN, OU+Dg uo paseg 107999P SAG ou Jo (%) Aowns9y 0I'9IƏəllqeIL 


460 M. Barni et al. 


Table 16.11 Detection accuracy (%) of the RDFS detector (based on BSnet) on increased- 
confidence adversarial examples 


Median filtering Image resizing 


c 

K 
K 
K 
K 
K 
K 
K 


Table 16.12 Detection accuracy (%) of the RDFS detector (based on the BC+net) on increased- 
confidence adversarial examples 


Median filtering Image resizing 

c 

K=N 3.3 
K = 400 3.2 
K = 200 4.5 
K = 50 8.7 
K =30 13.5 
K=10 25.2 
K=5 23.0 


16.4.4 Multiple-Classifier Architectures 


Another possibility to combat counter-forensics attacks is to resort to intrinsically 
more secure architectures. This approach has been widely considered in the general 
literature of ML security and in ML-based forensics, however there are only few 
methods resorting to such an approach in the case of CNNs. 

Among the approaches developed for general ML applications we mention 
(Biggio et al. 2015), where a multiple classifier architecture is proposed, referred 
to as a one-and-a-half classifier. The one-and-a-half class architecture consists of 
two-class classifiers and two one-class classifiers run in parallel followed by a final 
one-class classifiers. It has been shown that, when properly trained, this architecture 
can effectively limit the damage of an attacker with perfect knowledge. The one-and- 
a-half architecture has also been considered in forensic applications, to improve the 
security of SVM-based detectors (Barni et al. 2020). In particular, considering sev- 
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eral manipulation detection tasks, the authors of Barni et al. (2020) showed that the 
one-and-a-half class architecture can outperform two-class architectures in terms of 
security against white-box attacks, for a fixed level of robustness. The effectiveness 
of such an approach for CNN-based classifiers has not been investigated yet. 

Another simple possibility to design a more secure classifier consists in building 
an ensemble of individual algorithms using some convenient ensemble strategies, as 
proposed in Strauss et al. (2017) in the general DL literature. A similar approach 
has been consider in Fontani et al. (2014) for ML-based forensics, where the authors 
propose to improve the robustness of the decision by fusing the outputs of several 
forensic algorithms looking for different traces. The method has been assessed for 
the representative forensic task of splicing detection in JPEG images. Even in this 
case, the effectiveness of such an approach for CNN-based forensic applications has 
still to be investigated. 

Other approaches that have been proposed in security-oriented applications for 
general ML-based classification consist in using multiple classifiers in conjunction 
with randomization. Randomness can pertain to the selection of the training samples 
of the individual classifiers, like in Breiman (1996), or it can be associated with the 
selection of the features used by the classifiers, as in Ho (1998). Another strategy 
resorting to multiple classification and randomization has been proposed in Biggio 
et al. (2008) for spam-filtering applications, where the source of randomness is in the 
choice of the weights assigned to the filtering modules of the individual classifiers. 

Regarding the DL literature, an approach that goes in this direction is model 
switching (Wang et al. 2020), according to which a certain number of trained sub- 
models are randomly selected at test time. 

All considered, combining multiple classification with randomization in the 
attempt to improve the security of DL-based forensic classifiers is something that 
has not been studied much and is worth of further investigation. 


16.5 Final Remarks 


Deep learning architectures provide new powerful tools enriching the toolbox avail- 
able to image forensics designers. At the same time, they introduce new vulnera- 
bilities due to some inherent security weaknesses of DL techniques. In this chapter, 
we have reviewed the threats posed by adversarial examples and their possible use 
for counter-forensics purposes. Even if adversarial examples targeting image foren- 
sics networks tend to be less transferable than those created for computer vision 
applications, we have shown that, by properly increasing the strength of the attacks, 
a transferability level which is sufficient for practical applications can be reached. 
We have also presented some possible remedies against attacks based on adversar- 
ial examples, even if a definitive solution capable to prevent such attacks in the 
most challenging white-box scenario has not been found yet. We are convinced that, 
together with the necessity of developing image forensics techniques suitable to be 
applied outside controlled laboratory settings, robustness against intentional attacks 
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is one of the most pressing needs, if we want that image forensics is finally used in 
real-life applications contributing to restore the credibility of digital media. 
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Chapter 17 A) 
Anti-Forensic Attacks Using Generative get 
Adversarial Networks 


Matthew C. Stamm and Xinwei Zhao 


17.1 Introduction 


The rise of deep learning has led to rapid advances in multimedia forensics. Algo- 
rithms based on deep neural networks are able to automatically learn forensic traces, 
detect complex forgeries, and localize falsified content with increasingly greater 
accuracy. At the same time, deep learning has expanded the capabilities of anti- 
forensic attackers. New anti-forensic attacks have emerged, including those discussed 
in Chap. 14 based on adversarial examples, and those based on generative adversarial 
networks (GANSs). 

In this chapter, we discuss the emerging threat posed by GAN-based anti-forensic 
attacks. GANSs are a powerful machine learning framework that can be used to create 
realistic, but completely synthetic data. Researchers have recently shown that anti- 
forensic attacks can be built by using GANS to create synthetic forensic traces. While 
only a small number of GAN-based anti-forensic attacks currently exist, results show 
these early attacks are both effective at fooling forensic algorithms and introduce very 
little distortion into attacked images. Furthermore, by using the GAN framework to 
create new anti-forensic attacks, attackers can dramatically reduce the time required 
to create a new attack. 

For simplicity, we will assume in this chapter that GAN-based anti-forensic attacks 
will be launched against images. This also aligns with the current state of research, 
since all existing attacks are targeted at images. However, the ideas and methodolo- 
gies presented in this chapter can be generalized to a variety of media modalities 
including video and audio. We expect new GAN-based anti-forensic attacks targeted 
on these modalities will likely arise in the near future. 

This chapter begins with Sect. 17.2, which provides a brief background on GANS. 
Section 17.3 provides background on anti-forensic attacks, including how they are 
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traditionally designed as well as their shortcomings. In Sect. 17.4, we discuss how 
GANS are used to create anti-forensic attacks. This gives a high-level overview 
of the components of an anti-forensic GAN, differences between attacks based on 
GANs and adversarial examples, as well as an overview of existing GAN-based 
anti-forensic attacks. Section 17.5 provides more details about how these attacks are 
trained, including different techniques that can be employed based on the attacker’s 
knowledge of and access to the forensic algorithm under attack. We discuss known 
shortcomings of anti-forensic GANSs and future research directions in Sect. 17.6. 


17.22 Background on GANs 


Generative adversarial networks (GANs) are a machine learning framework that 
is used to train generative models (Goodfellow et al. 2014). In their standard form, 
GANs consist of two components: a generator G and a discriminator D. The generator 
and the discriminator can be viewed as competitors in a two-player game. The goal 
of the generator is to learn to produce deceptively synthetic data that can mimic 
the real data, while the goal of the discriminator aims to learn to distinguish the 
synthetic data from the real data. The two parties are trained alternatively and each 
party attempts to improve its performance to defeat the other until the synthetic data 
is indistinguishable from the real data. 

The generator is a generative model that creates synthetic data x’ by mapping some 
input z in a latent space to an output as a set of real data x. Ideally, the generator 
produces synthetic data x’ with the same distribution as real data x such that the two 
cannot be differentiated. The discriminator is a discriminative model that will assign 
a scalar to each input data. The discriminator can output any scalar between 0 and 1, 
with | representing the real data and 0 representing the synthetic data. During training 
phase, the discriminator aims to maximize the probability of assigning correct label 
to both real data x and synthetic data x’, while the generator aims to reduce the 
difference between the discriminator’s decision on the synthetic data and 1 (i.e., 
it must be real data). Training a GAN is equivalent to finding the solution of the 
min-max function 


min max Ex~ py; [log D(X)] + Er slog — D(G(2))] (17.1) 


where fx (x) represents the distribution of real data, f; (z) represents the distribution 
of input noise, E represents the operation of calculating expected value. 
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17.2.1 GANSs for Image Synthesis 


While GANs can be used to synthesize many types of data, one of their most com- 
mon uses is to synthesize images or modify images. The development of several 
specific GANs has provided important insights and guidelines into the design of 
GANs for images. For example, while the GAN framework does not specify the 
specific form of the generator or discriminator, the development of DCGAN showed 
that deep convolutional generators and discriminators yield substantial benefits when 
performing image synthesis (Radford et al. 2015). Additionally, this work suggested 
constraints on the architectures of the generator and discriminator that can result in 
stable training. The creation of Conditional GANs (CGANs) showed the benefit of 
using information such as labels as auxiliary inputs to both the generator and the 
discriminator (Mehdi Mirza and Simon Osindero 2014). This can help can improve 
the visual quality and control the appearance of synthetic images generated by a 
GAN. InfoGAN showed that the latent space can be structured to both select and 
control generation of images (Chen et al. 2016). Pix2Pix (Isola et al. 2018) showed 
that GANs can learn to translate from one image space to another (Isola et al. 2018), 
while StackGAN demonstrated that text can be used to guide the generation of images 
via a series of CGANs (Zhang et al. 2017). 

Since their first appearance in literature, the development of GANS for image 
synthesis has proceeded at an increasingly rapid pace. Research has been performed 
to improve the architectures of the generator and discriminator (Zhang et al. 2019; 
Brock et al. 2018; Karras et al. 2017; Zhu et al. 2017; Karras et al. 2020; Choi 
et al. 2020), improve the training procedure (Arjovsky et al. 2017; Mao et al. 2017), 
and build publicly available datasets to aid the research community (Yu et al. 2016; 
Liu et al. 2015; Karras et al. 2019; Caesar et al. 2018). Because of these efforts, 
GANs have been used to achieve state-of-the-art performance on various computer 
vision tasks, such as super-resolution (Ledig et al. 2017), photo inpainting (Pathak 
et al. 2016), photo blending (Wu et al. 2019), image-to-image translation (Isola et al. 
2018; Zhu et al. 2017; Jin et al. 2017), text-to image-translation (Zhang et al. 2017), 
semantic-image-to-photo translations (Park et al. 2019), face aging (Antipov et al. 
2017), generating human faces (Karras et al. 2017; Choi et al. 2018, 2020; Karras 
et al. 2019, 2020), generating 2-D (Jin et al. 2017; Karras et al. 2017; Brock et al. 
2018) and 3-D objects (Wu et al. 2016), video prediction (Vondrick et al. 2016), and 
many more applications. 


17.3 Brief Overview of Relevant Anti-Forensic Attacks 


In this section, we provide a brief discussion of what anti-forensic attacks are and 
how they are designed. 
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17.3.1 What Are Anti-Forensic Attacks 


Anti-forensic attacks are countermeasures that a media forger can use to fool forensic 
algorithms. They operate by falsifying the traces that forensic algorithms rely upon 
to make classification or detection decisions. A forensic classifier that is targeted by 
an anti-forensic attack is referred to as a victim classifier. 

Just as there are many different forensic algorithms, there are similarly a wide 
variety of anti-forensic attacks (Stamm et al. 2013; Böhme and Kirchner 2013). Many 
of them, however, can be broadly grouped into two categories: 


Attacks designed to disguise evidence of manipulation and editing. Attacks 
have been developed to fool forensic algorithms designed to detect a variety of 
manipulations such as resampling (Kirchner and Bohme 2008; Gloe et al. 2007), 
JPEG compression (Stamm and Liu 2011; Stamm et al. 2010; Fan et al. 2013, 2014; 
Comesana-Alfaro and Pérez-Gonzalez 2013; Pasquini and Boato 2013; Chu et al. 
2015), median filtering (Fontani and Barni 2012; Wu et al. 2013; Dang-Nguyen 
et al. 2013; Fan et al. 2015), contrast enhancement (Cao et al. 2010), and unsharp 
masking (Laijie et al. 2013), as well as algorithms detect forgeries by identifying 
inconsistencies in lateral chromatic aberration (Mayer and Stamm 2015), identify 
copy-move forgeries (Amerini et al. 2013; Costanzo et al. 2014), and detect video 
frame deletion (Stamm et al. 2012a, b; Kang et al. 2016). 

Attacks designed to falsify the source of a multimedia file. Attacks have been 
developed to falsify demosaicing traces associated with an image’s source camera 
model (Kirchner and Böhme 2009; Chen et al. 2017)??, as well as PRNU traces 
that can be linked to a specific device (Lukas et al. 2005; Gloe et al. 2007; Goljan 
et al. 2010; Barni and Tondi 2013; Dirik et al. 2014; Karaküçük and Dirik 2015) 


To understand how anti-forensic attacks operate, it is helpful to first consider 
a simple model of how forensic algorithms operate. Forensic algorithms extract 
forensic traces from an image, then associate those traces with a particular forensic 
class. That class can be associated with editing operation or manipulation that an 
image has undergone, the image’s source (i.e., it’s camera model, its source device, it’s 
distribution channel, etc.,) or some other property that a forensic investigator wishes 
to ascertain. It is important to note that when performing manipulation detection, even 
unaltered images contain forensic traces. While they do not contain traces associated 
with editing, splicing, or falsification, they do contain traces associated with an image 
being generated by a digital camera and not undergoing subsequent processing. 

Anti-forensic attacks create synthetic forensic traces within an image that are 
designed to fool a forensic classifier. This holds true even for attacks designed to 
remove evidence of editing. These attacks synthesize traces in manipulated images 
that are associated with unaltered images. Most anti-forensic attacks are targeted, 
i.e., they are designed to trick the forensic classifier into associating an image with a 
particular target forensic class. For example, attacks against manipulation detectors 
typically specify the class of ‘unaltered’ images as the target class. Attacks designed 
to fool source identification algorithms synthesize traces associated with a target 
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source that the image did not truly originate from. Additionally, a small number of 
anti-forensic attacks are untargeted. This means that the attack does not care which 
class the forensic algorithm associates the image with, so long as it is not the true class. 
Untargeted attacks are more commonly used to fool source identification algorithms, 
since an untargeted attack still tricks the forensic algorithm into believing that an 
image can form an untrue source. They are typically not used against manipulation 
detectors, since a forensic investigator will still decide that image is inauthentic even 
if their algorithm mistakes manipulation A for manipulation B. 


17.3.2 Anti-Forensic Attack Objectives and Requirements 


When creating an anti-forensic attack, an attacker must consider several design objec- 
tives. For an attack to be successful, it must satisfy the following design requirements; 


e Requirement 1: Fool the victim forensic algorithm 
An anti-forensic attack should cause the victim algorithm to classify an image 
as belonging to the attacker’s target class. If the attack is designed to disguise 
manipulation or content falsification, then the target class is the class of unaltered 
images. If the attack is designed to falsify an image’s source, then the target class is 
the desired fake source chosen by the attacker such as a particular source camera. 

e Requirement 2: Introduce no visually perceptible artifacts. 
An attack should not introduce visually distinct telltale signs that a human can 
easily identify. If this occurs, a human will quickly disregard an attacked image 
as fake even if it fools a forensic algorithm. This is particularly important if the 
image is to be used as part of a misinformation campaign, since images that are 
not plausibly real will be quickly flagged. 

e Requirement 3: Maintain high visual quality of the attacked image. 
Similar to Requirement 2, if an attack fools a detector but significantly distorts 
an image, it is not useful to an information attacker. Some distortion is allowable, 
since ideally no one other than the attacker will be able to compare the attacked 
image to its unattacked counterpart. Additionally, this requirement is put in place 
to ensure that an attack does not undo desired manipulations that an attacker has 
previously performed to an image. For example, localized color corrections and 
contrast adjustments may be required to make a falsified image appear visually 
plausible. If an anti-forensic attack reverses these intentional manipulations, then 
the attack is no longer successful. 


In addition to these requirements, there are several other highly desirable proper- 
ties for an attack to possess. These include 
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e Desired Goal 1: Be rapidly deployable in practical scenarios. 
Ideally, an attack should be able to be launched quickly and efficiently, thus 
enabling rapid attacks or attacks at scale. It is flexible enough to attack images of 
any arbitrary size, not only images of a fixed predetermined size. Additionally, it 
should not require prior knowledge of which region of an image will be analyzed, 
such as a specific image block or block grid (it is typically unrealistic to assume 
that an investigator will only examine certain fixed image locations). 

e Desired Goal 2: Achieve attack transferability. 
An attack achieves transferability if it is able to fool other victim classifiers that 
it has not been explicitly designed or trained to fool. This is important because an 
investigator can typically utilize multiple different forensic algorithms to perform 
a particular task. For example, several different algorithms exist for identifying an 
image’s source camera model. An attack designed to fool camera model identifi- 
cation algorithm A is maximally effective if it also transfers to fool camera model 
identification algorithms B and C. 


While it is not necessary that an attack satisfy these additional goals, attacks that 
do are significantly more effective in realistic scenarios. 


17.3.3 Traditional Anti-Forensic Attack Design Procedure 
and Shortcomings 


While anti-forensic attacks target different forensic algorithms, at a high level the 
majority of them operate in a similar manner. First, an attack builds or estimates a 
model of the forensic trace that it wishes to falsify. This could be associated with 
the class of unaltered images in the case that the attack is designed to disguise 
evidence of manipulation or falsification, or it could be associated with a particular 
image source. Next, this model is used to guide a technique that synthesizes forensic 
traces associated with this target class. For example, anti-forensic attacks to falsify 
an image’s source camera model synthesize demosaicing traces or other forensic 
traces associated with a target camera model. Alternatively, techniques designed to 
remove evidence of editing such as multiple JPEG compression or median filtering 
synthesize traces that manipulation detectors associate with unaltered images. 
Traditionally, creating a new anti-forensic attack has required a human expert to 
first design an explicit model of the target trace that the attack wishes to falsify. Next, 
the expert must create an algorithm capable of synthesizing this target trace such that it 
matches their model. While this approach has led to the creation of several successful 
anti-forensic attacks, it is likely not scalable enough to keep pace with the rapid 
development of new forensic algorithms. It is very difficult and time consuming for 
humans to construct explicit models of forensic traces that are accurate enough to for 
an attack to be successful. Furthermore, the forensic community has widely adopted 
the use machine learning and deep learning approaches to learn sophisticated implicit 
models directly from data. In this case, it may be intractable for human experts to 
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explicitly model a target trace. Even if a model can be successfully constructed, a 
new algorithm must be designed to synthesize each trace under attack. Again, this is 
also a challenging and time-consuming process. 

In order to respond to the rapid advancement of forensic technologies, adversaries 
would like to develop some automated means of creating new anti-forensic attacks. 
Tools from deep learning such as convolutional neural networks have allowed foren- 
sics researchers to successfully automate how forensic traces are learned. Similarly, 
intelligent adversaries have begun to look to deep learning for approaches to auto- 
mate the creation of new attacks. As a result, new attacks based upon GANs have 
begun to emerge. 


17.3.4 Anti-Forensic Attacks on Parametric Forensic Models 


WIth the advances of software and apps, it is very easy for people to manipulate 
images the way they want. To make an image deceptively convincing to human eyes, 
post-processing operations such as resampling, blending, denoising are often applied 
to the falsified images. Recent years, deep learning-based algorithms are also used to 
create innovation contents. In the hand of an attacker, these techniques can be used 
for malicious purposes. Previous research shows that manipulations and attacks leave 
traces that can be modeled or characterized by forensic algorithms. While these traces 
may be invisible to human eyes, through forensic algorithms forged images can be 
detected or distinguished from the real images. 

Anti-forensic attacks are countermeasures attempting to fool target forensic algo- 
rithms. Some anti-forensic attacks aim to remove forensic traces left by manipulation 
operations or attacks that the forensic algorithms analysis. Some anti-forensic attacks 
synthesize fake traces to make the forensic algorithms make a wrong decision. For 
example, anti-forensic attacks can remove traces left by resampling and make the 
forensic algorithms designed for characterizing resampling traces to believe the anti- 
forensically attacked images as “unaltered”. Anti-forensic attacks can also synthe- 
size fake forensic traces associated with a particular target camera model B into an 
images, and make a camera model identification algorithm to believe an image was 
captured by the camera model B, while the image was actually captured by cam- 
era model A. Anti-forensic attacks can also hide forensic traces to obfuscate their 
true origins. Anti-forensic attacks are expected to leave invisible distortion, since an 
visible distortion often flag “fake” for investigators. However, anti-forensic attacks 
do not consider countermeasures of itself, and only examine fooling target forensic 
algorithms. 

Additionally, anti-forensic attacks are often designed to falsify particular foren- 
sic traces. A common methodology is typically used to create anti-forensic attacks 
similar to those described in the previous section. First, a human expert designs a 
parametric model of the target forensic trace that they wish to falsify during the attack. 
Next, an algorithm is created to introduce a synthetic trace into the attacked image 
so that it matches this model. Often this is accomplished by creating a generative 
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noise model which is used to introduce specially designed noise either directly into 
the image or into some transform of the image. Many anti-forensic attacks have been 
developed following this methodology, such as removing traces left by median fil- 
tering, JPEG compression, or resampling, or synthesizing demosaicing information 
of camera models. 

This approach to creating attacks has several important drawbacks. First, it is both 
challenging and time consuming for human experts to create accurate parametric 
models of forensic traces. In some cases it is extremely difficult or even infeasible to 
create models accurate enough to fool state-of-the-art detectors. Furthermore, models 
of one forensic cannot typically be reused to attack a different forensic trace. This 
creates an important scalability challenge for the attacker. 


17.3.5 Anti-Forensic Attacks on Deep Neural Networks 


An even greater challenge arises from the widespread adoption of deep-learning- 
based forensic algorithms such as convolutional neural networks (CNNs). The 
approach described above cannot be used to attack forensic CNNs since they don’t 
utilize an explicit feature set or model of a forensic trace. Anti-forensic attacks tar- 
geting on deep-learning-based forensic algorithms now draw an increasing amount 
of attentions both from academia and industries. 

While deep learning has achieved many state-of-the-art performances on many 
machine learning tasks, including computer vision and multimedia forensics, 
researchers found that non-ideal properties of neural networks can be exploited and 
cause misclassifications of an input image. Several explanations have been posited for 
the non-ideal properties of neural networks that adversarial examples exploit, such 
as imperfections caused by the locally-linear nature of neural networks (Goodfellow 
et al. 2014) or that there is misalignment between the features used by humans and 
those learned by neural networks when performing the same classification task (Ilyas 
et al. 2019). These findings facilitate the development of adversarial example gener- 
ation algorithms. 

Adversarial example generation algorithms operate by adding adversarial pertur- 
bations to an input image. The adversarial perturbations are some noises produced 
by optimizing a certain distance metric, usually L, norm (Goodfellow et al. 2014; 
Madry et al. 2017; Kurakin et al. 2016; Carlini and Wagner 2017). The adversarial 
perturbations aim to push the representation in latent feature space just across the 
decision boundary of the desired class (i.e., targeted attack) or push the data away 
from its true class (i.e., untargeted attack). Forensic researchers found that adversarial 
example generation algorithms can be used as anti-forensic attacks on forensic algo- 
rithms. In 2017, Giiera et al. showed that camera model identification CNNs (Giiera 
et al. 2017) can be fooled by Fast Gradient Sign method (FGSM) (Goodfellow et al. 
2014) and Jacobian-based Saliency Map (JSMA) (Papernot et al. 2016). 
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There are several drawbacks for using adversarial example generation algorithms 
asanti-forensic attacks. To ensure the CNN can be successfully fooled, the adversarial 
perturbations added to the image may be strong and result in fuzzy or distorted output 
images. While using some tricks, researchers could control the visual quality of anti- 
forensic attacked images to a certain level, maintaining high visual quality is still a 
challenge for further study. Another drawback is that adversarial perturbations have to 
be optimized for particular given input image. As a result, it could be time-consuming 
for producing large volume of anti-forensically attacked images. Additionally, the 
adversarial perturbations are not invariant to detection block alignment. 


17.4 Using GANs to Make Anti-Forensic Attacks 


Recently, researchers have begun adapting the GAN framework to create a new class 
of anti-forensic attacks. Though research into these new GAN-based attacks is still 
in its early stages, recent results have shown that these attacks can overcome many 
of the problems associated with both traditional anti-forensic attacks and attacks 
based on adversarial examples. In this section, we will describe at a high level how 
GAN-based anti-forensic attacks are constructed. Additionally, we will discuss the 
advantages of utilizing these types of attacks. 


17.4.1 How GANS Are Used to Construct Anti-Forensic 
Attacks 


GAN-based anti-forensic attacks are designed to fool forensic algorithms built upon 
neural networks. At a high level, they operate by using an adversarially trained 
generator to synthesize a target set of forensic traces in an image. Training occurs 
first, before the attack is launched. Once the anti-forensic generator has been trained, 
an image can be attacked simply by passing it through the generator. The generator 
does not need to be re-trained for each image, however, it must be re-trained if a new 
target class is selected for the attack. 

Researchers have shown that GANs can be adapted to both automate and gen- 
eralize the anti-forensic attack design methodology described in Sect. 17.5 of this 
chapter. The first step in the traditional approach to creating an anti-forensic attack 
involves the creation of a model of the target forensic trace. While it is difficult or 
potentially impossible to build an explicit model of the traces learned by forensic 
neural networks, GANs can exploit the model that has already been learned by the 
forensic neural network. This is done by adversarially training the generator against 
a pre-trained version of the victim classifier C. The victim classifier can either take 
the place of the discriminator in the traditional GAN formulation, or it can be used 
in conjunction with a traditional discriminator. Currently, there is not a clear consen- 
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sus on whether a discriminator is necessary or beneficial when creating GAN-based 
anti-forensic attacks. Additionally, in some scenarios the attacker may not have direct 
access to the victim classifier. If this occurs, the victim classifier cannot be directly 
integrated into the adversarial training process. Instead, alternate means of training 
the generator must be used. These are discussed in Sect. 17.5. 

When the generator is adversarially trained against the victim classifier, it learns 
the model of the target forensic trace implicitly used by the victim classifier. Because 
of this, the attacker does not need to explicitly design a model of the target trace, as 
would typically be done in the traditional approach to creating an anti-forensic attack. 
This dramatically reduces the time required to construct an attack. Furthermore, it 
increases the accuracy of the trace model used by the attack, since it is directly 
matched to the victim classifier. 

Additionally, GAN-based attacks significantly reduce the effort required for the 
second step of the traditional approach to creating an anti-forensic attack, i.e., creating 
a technique to synthesize the target trace. This is because the generator automatically 
learns to synthesize the target trace through adversarial training. While existing GAN- 
based anti-forensic attacks utilize different generator architectures, the generator still 
remains somewhat generic in that it is a deep convolutional neural network. CNNs 
can be quickly and easily implemented using deep learning frameworks such as 
TensorFlow or Pytorch, making it easy for an attacker to experimentally optimize 
the design of their generator. 

Currently, there are no well-established guidelines for designing anti-forensic 
generators. However, research by Zhang et al. has shown that when an upsampling 
component is utilized in a GAN’s generator, it leaves behind “checkerboard” artifacts 
in the synthesized image. Zhang et al. (2019). As a result, it is likely important to 
avoid the use of upsampling in the generator so that visual or statistically detectable 
artifacts that may act as giveaways are not introduced to an attacked image. This is 
reinforced by the fact that many existing attacks utilize fully convolutional generators 
that do not include any pooling or upsampling layers. 

Convolutional generators also possess the useful property that once they are 
trained, they can be used to attack an image of any size. Fully convolutional CNNs 
are able to accept images of any size as an input and produce output images of the 
same size. As a result, these generators can be deployed in realistic scenarios in 
which the size of the image under attack may not be known a prior, or may vary in 
the case of attacking multiple images. They also provide an advantage in that synthe- 
sized forensic traces are distributed throughout the attacked image and do not depend 
on a blocking grid. As a result, the attacker does not require advanced knowledge 
of which image region or block that an investigator will analyze. This is a distinct 
advantage over anti-forensic attacks based upon adversarial examples. 

Research also suggests that a well-constructed generator can be re-trained to 
synthesize different forensic traces than the ones which it was initially designed for. 
For example, the MISLGAN attack was initially designed to falsify an image’s origin 
by synthesizing source camera model traces linked with a different camera model. 
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Recent work has also used this generator architecture to construct a GAN-based 
attack that removes evidence of multiple editing operations. 


17.4.2 Overview of Existing GAN-Based Attacks 


Recently, several GAN-based anti-forensic attacks have been developed to falsify 
information about an image’s source and authenticity. Here we provide a brief 
overview of existing GAN-based attacks. 

In 2018, Kim et al. proposed a GAN-based attack to remove median filtering 
traces from an image (Kim et al. 2017). This GAN was built using a generator and 
a discriminator. The generator was trained to remove the median filtering traces, 
and the discriminator was trained to distinguish between the restored images and 
the unaltered images. The author introduced loss computed from a pre-trained VGG 
network to improve the visual quality of attacked images. This attack was able to 
remove media filtering traces from gray-scale images, and fool many traditional 
forensic algorithms using hand-designed features, and CNN detectors. 

Chen et al. proposed a GAN-based attack in 2018 named MISLGAN to falsify an 
image’s source camera model (Chen et al. 2018). In this work, the authors assumed 
that the attacker has access to the forensic CNN that the investigator will use to 
identify an image’s source camera model. Their attack attempts to modify the traces 
in images associated with a target camera model such that the forensic CNN mis- 
classifies an attacked image as having been captured by the target camera model. 
MISLGAN is consists of three major components: a generator, a discriminator, and 
the pre-trained forensic camera model CNN. The generator was adversarially trained 
against both the discriminator and the camera model CNN for each target camera 
model. The trained generator was shown to be able to falsify fake camera models 
for any color images with high success attack rate (i.e., the percentage of an image 
being classified as the target camera model after being attacked). This attack was 
demonstrated to be effective even when images under attack were captured by cam- 
era models not used during training. Additionally, the attacked images also maintain 
high visual qualities even comparing with the original unaltered image side by side. 

In 2019, the authors of MISLGAN extended this work to create a black box 
attack against forensic source camera model identification CNNs (Chen et al. 2019). 
To accomplish this, the authors integrated a substitute network to approximate the 
camera model identification CNNs under attack. By training MISLGAN against a 
substitute network, this attack can successfully fool state-of-the-art camera model 
CNNs even in black-box scenarios, and outperformed adversarial example generation 
algorithms such as FGSM (Goodfellow et al. 2014). This GAN-based attack has 
been adapted to remove traces left by editing operations and to create transferable 
anti-forensic attacks (Zhao et al. 2021). Additionally, by improving the generator’s 
architecture, Zhao et al. were able to falsify traces left in synthetic images by the 
GAN generation process (Xinwei Zhao and Matthew C. Stamm 2021). This attack 
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was shown to fool existing synthetic image detectors targeted at detecting both GAN- 
generated faces and objects. 

In 2021, Cozzolino et al. proposed a GAN-based attack, SpoC (Cozzolino et al. 
2021), to falsify camera model information of a GAN-synthesized image. SpoC is 
consisted of a generator, a discriminator, and a pre-trained embedder. The embedder 
is used to learn the reference vectors of real images or the images of the target cam- 
era models in the latent spaces. This attack can successfully fool GAN-synthesized 
image detectors or the camera model identification CNNs, and outperformed adver- 
sarial example generation algorithm such as FGSM (Goodfellow et al. 2014) and 
PGD (Madry et al. 2017). The authors also showed that the attack is robust to JPEG 
compression of a certain level. 

In 2021, Wu et al. proposed a GAN-based attack, JRA-GAN, to restore JPEG 
compressed images (Wu et al. 2021). The authors used a GAN to restore the high 
frequency information in the DCT domain, and remove the blocking effect caused 
by JPEG compression to fool JPEG compression detection algorithms. The authors 
showed that the attack can successfully restore gray-scale JPEG compressed images 
and fool traditional JPEG detection algorithms using hand-designed features, and 
CNN detectors. 


17.4.3 Differences Between GAN-Based Anti-Forensic 
Attacks and Adversarial Examples 


While attacks built from GANs and from adversarial examples both use deep learning 
to create anti-forensic countermeasures, it is important to note that the GAN-based 
attacks described in this chapter are fundamentally different from adversarial exam- 
ples 

As discussed in Chap. 16, adversarial examples operate by exploiting non-ideal 
properties of a victim classifier. There are several explanations that have been con- 
sidered by the research community as to why adversarial examples exist and are 
successful. These include “blind spots” caused by non-ideal training or overfitting, 
a mismatch between features used by humans and those learned by neural networks 
when performing classification (Ilyas et al. 2019), and inadvertent effects caused by 
the locally-linear nature of neural networks (Goodfellow et al. 2014). In all of these 
explanations, adversarial examples cause a classifier to misclassify due to unintended 
behavior. 

By contrast, GAN-based anti-forensic attacks do not try to trick the victim clas- 
sifier into misclassifying an image by exploiting its non-ideal properties. Instead, 
they replace the forensic traces present in an image with a synthetic set of traces that 
accurately match the victim classifier’s model of the target trace. In this sense, they 
attempt to synthesize traces that the victim classifier correctly uses to perform tasks 
such as manipulation detection and source identification. This is possible because 
forensic traces, such as those linked to editing operations or an image’s source cam- 
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era, are typically not visible to the human eye. As a result, GANSs are able to create 
new forensic traces in an image without altering its content or visual appearance. 

Additionally, adversarial example attacks are customized to each image under 
attack. Often, this is done through an iterative training or search process that must 
be repeated each time a new image is attacked. It is worth noting that GAN-based 
techniques for generating adversarial examples that have recently been proposed in 
the computer vision and machine learning literature (Poursaeed et al. 2018; Xiao 
et al. 2018). These attacks are distinct from the GAN-based attacks discussed in this 
chapter, specifically because they generate adversarial examples and not synthetic 
versions of features that should accurately be associated with the target class. For 
these attacks, the GAN framework is instead adapted to search for an adversarial 
example that can be made from a particular image in the same manner that iterative 
search algorithms are used. 


17.4.4 Advantages of GAN-Based Anti-Forensic Attacks 


There are several advantages to using GAN-based anti-forensic attacks over both 
traditional human-designed approaches and those based on adversarial examples. 
We briefly summarize these below. Open problems and weaknesses of GAN-based 
adversarial attacks are further discussed in Sect. 17.6. 


e Rapid and automatic attack creation: As discussed earlier, GANs both automate 
and generalized the traditional approach to designing anti-forensic attacks. As a 
result, it is much easier and quicker to create new GAN-based attacks than through 
traditional human-designed approaches. 

e Low visual distortion: In practice, existing GAN-based attacks tend to introduce 
little-to-no visually detectable distortion. While this condition is not guaranteed, 
visual distortion can be effectively controlled during training by carefully balanc- 
ing the weight placed on the generator’s loss term controlling visual quality. This 
is discussed in more detail in Sect. 17.5. By contrast, several adversarial example 
attacks can leave behind visually detectable noise-like artifacts (Giiera et al. 2017). 

e Attack only requires training once: After training, GAN-based attack can be 
launched on any image as long as the target class remains constant. This differs 
from attacks based on adversarial examples, which need to create a custom set of 
modifications for each image, often involving an iterative algorithm. Additionally, 
no image-specific parameters need to be learned, unlike for several traditional 
anti-forensic attacks. 

e Scalable deployment in realistic scenarios: Attack requires only to pass the 
image through a pre-trained convolutional generator. It launches very rapidly and 
can be deployed at scale. Additionally, since the attack is launched 

e Ability to attack images of arbitrary size: Because the attack is launched via 
a convolutional generator, the attack can be applied to images of any size. By 
contrast, attacks based on adversarial examples only produce attacked images of 
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the same size as the input to the victim CNN, which is typically a small patch and 
not a full sized image. 

e Does not require advanced knowledge of the analysis region: One way to adapt 
adversarial attacks to larger images is to break the image into blocks, then attack 
each block separately. This however requires advanced knowledge of the particular 
blocking grid that will be used during forensic analysis. If the blocking grid used 
by the attacker and the detector do not align, or if the investigator analyzes an 
image multiple times with different blocking grids, adversarial example attacks 
are unlikely to be successful. 


17.5 Training Anti-Forensic GANs 


In this section, we give an overview of how to train an anti-forensic generator. We 
begin by discussing the different terms of the loss function used during training. We 
shown how each term of the loss function is formulated in the perfect knowledge 
scenario (i.e., when a white box attack is launched). After this, we show how to 
modify the loss function to train the attack to create black box attacks in limited 
knowledge scenarios, as well transferable attacks in the zero knowledge scenario. 


17.5.1 Overview of Adversarial Training 


During adversarial training, the generator G used in GAN-based attack should be 
incentivized to produce visually convincing anti-forensic attacked images that can 
fool the victim forensic classifier C, as well as a discriminator D if it is used. These 
goals are achieved by properly formulating the generator’s loss function Cç. A typical 
loss function for adversarially training an anti-forensic generator consists of three 
major loss terms: the classification loss Cc, the adversarial loss £4 (if a discriminator 
is used), and perceptual loss £ p, 


Leo = a£p + BCc + La, (17.2) 


where a, 0, y are weights to balance the performance of the attack and the visual 
quality of the attacked images. 

Like all anti-forensic attacks, the generator’s primary goal is to produce output 
images that fool a particular forensic algorithm. This is done by introducing a term 
into the generator’s loss function that we describe as the forensic classification loss, 
or classification loss for short. Classification loss is the key element to guide the 
generator to learn the forensic traces of the target class. Depending on the attacker’s 
access to and knowledge level of the forensic classifier, particular strategies may need 
to be adopted during training We will elaborate on this later in this section. Typically, 
the classification loss is defined as 0 if the output of the generator is classified by the 
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forensic classifier as belonging to the target class ¢ and 1 if the output is classified as 
any other class. 

If the generator is trained to fool a discriminator as well, then the adversarial 
loss provided by the feedback of the discriminator is consolidated into the total loss 
function. Typically, the adversarial loss is 0 if the generated anti-forensically attacked 
images fools the discriminator and 1 if not. 

At the same time, the anti-forensic generator should introduce minimal distor- 
tion into the attacked image. Some amount of distortion is acceptable, since in an 
anti-forensic setting, the unattacked image will not be presented to the investigator. 
However these distortions should not undo or significantly alter any modifications 
that were introduced by the attacker before the image is passed through the anti- 
forensic generator. To control the visual quality of anti-forensically attack images, 
the perceptual loss measures the pixel-wise difference between the image under 
attack and the anti-forensically attacked image produced by the generator. Minimiz- 
ing the perceptual loss during the adversarial training process helps to ensure that 
attacked images produced by the generator contain minimal, visually imperceptible 
distortions. 

When constructing an anti-forensic attack, itis typically advantageous to exploit as 
much knowledge of the algorithm under attack as possible. This holds especially true 
with GAN-based attacks, which directly integrates a forensic classifier into training 
the attack. Since gradient information from a victim forensic classifier is needed 
to compute the classification loss, different strategies must be adopted to train the 
anti-forensic generator depending on the attacker’s knowledge about the forensic 
classifier under attack. As a result, it is necessary to provide more detail about the 
different knowledge levels that an attacker may have of the victim classifier before 
we are able to completely formulate the loss function. We note that the knowledge 
level typically only has an influence on the classification loss. The perceptual loss 
and discriminator loss remain unchanged for all knowledge levels. 


17.5.2 Knowledge Levels of the Victim Classifier 


As mentioned previously, we refer to the forensic classifier under attack as the victim 
classifier C. Depending on the attacker’s knowledge level of the victim classifier, it 
is common to categorize the knowledge scenarios into the perfect knowledge (i.e., 
white box) scenario, the limited knowledge (i.e., black box) scenario, and the zero 
knowledge scenario. 

In the perfect knowledge scenario, the attacker has full access to the victim clas- 
sifier or an exact identical copy of the victim classifier. This is an ideal situation for 
the attacker. The attacker can launch a white box attack in which they train directly 
against the victim classifier that they wish to fool. 

In many other situations, however, the attacker has only partial knowledge or even 
zero knowledge of the victim classifier. Partial knowledge scenarios may include a 
variety of specific settings in real life. A common aspect is that the attacker does not 
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have full access or control over the victim classifier. However, the attacker is allowed 
to probe the victim classifier as a black box. For example, they can query the victim 
classifier using an input, then observe the value of its output. Because of this, the 
partial knowledge scenario is also referred to as the black box scenario. This scenario 
is a more challenging, but potentially more realistic situation for the attacker. For 
example, a piece of forensic software may or online service may be encrypted to 
reduce the risk of attack. As a result, its users would only be able to provide input 
images and observe the results that the system output. 

An even more challenging situation is when the attacker has zero knowledge of 
the victim classifier. Specifically, the attacker not only has no access to the victim 
classifier, but also the attacker is not allowed to observe the victim classifier by any 
means. This is also a realistic situation for the attacker. For example, an investigator 
may develop private forensic algorithms and keep this information classified for 
security purposes. In this case, we assume that the attacker knows what the victim 
classifier’s goal is (i.e., manipulation detection, source camera model identification, 
etc.). A broader concept of zero knowledge may include the attacker not knowing if 
a victim classifier exists, or what the specific goals of the victim classifier are (i.e., 
the attacker will not know anything about an image will be analyzed). Currently, 
however, this is beyond the scope of existing antiforensics research. 

In both the black box and the zero knowledge scenarios, the attacker cannot use 
the victim classifier C to directly train the attack. The attacker needs to modify the 
classification loss in (17.2) such that a new classification loss is provided by other 
classifiers that can accessed during attack training. We note that the perceptual loss 
and the adversarial loss usually remain the same for all knowledge scenarios, or may 
need trivial adjustment based on specific cases. The main change, however, is the 
formulation of the classification loss. 


17.5.3 White Box Attacks 


We start by discussing the formulation of each loss term in the perfect knowledge 
scenario, i.e., when creating a white box attack. Each term in the generator’s loss is 
defined below. The perceptual loss and the adversarial loss will remain the same for 
all subsequent knowledge scenarios. 


e Perceptual Loss: The perceptual loss Lp can be formulated in many different 
ways. The mean squared error (Lz loss) or mean absolute difference (Li loss) 
between an image before and after attack are the most commonly used formula- 
tions, i.e., 

Lp = H- GODI, (17.3) 


where N is the number of pixels in the reference image / and anti-forensically 
attacked image G(J), || - || p denotes the p norm and p equals to 1 or 2. Empiri- 
cally, the L loss usually yields better visual quality for GAN-based anti-forensic 
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attacks. This is potentially because the Lı loss penalizes small pixel differences 
comparatively more than the L2 loss, which puts greater emphasis on penalizing 
large pixel differences. 

e Adversarial Loss: The adversarial loss £4 is used to ensure the anti-forensically 
attacked image can fool the discriminator D. Ideally, the discriminator’s output 
of the anti-forensically attacked images is 1 when the attacked images fool the 
discriminator, 0 otherwise. Therefore, the adversarial loss is typically formulated 
as the sigmoid cross-entropy between the discriminator’s output of the generated 
anti-forensically attacked image and 1. 


La = log(1 — D(G(W))), (17.4) 


As mentioned previously, a separate discriminator is not always used when creating 
a GAN-based anti-forensic attack. If this is the case, then the adversarial loss term 
can be discarded. 

e Classification Loss: In the perfect knowledge scenario, since the victim classifier 
C is fully accessible by the attacker, the victim classifier can be directly used to 
calculate the forensic classification loss Cc. Typically this is done by computing 
the softmax cross-entropy between the classifier’s output and the target class t, 
ie., 


Le = — log(C(G()),) (17.5) 


where C(G(J)); is the softmax output of victim classifier at location t. 


17.5.4 Black Box Scenario 


The perfect knowledge scenario is often used to evaluate the baseline performance of 
an anti-forensic attack. However, it is frequently unrealistic to assume that an attacker 
would have full access to the victim classifier. More commonly, the victim classifier 
may be presented as a black box to the attacker. Specifically, the attacker does not 
have full control over the victim classifier, nor do they have enough knowledge of 
its architecture to construct a perfect replica. As a result, the victim classifier can 
not be used to produce the classification loss as formulated in the perfect knowledge 
scenario. The victim classifier can, however, be probed by the attacker as a black 
box, then the output can be observed. This is done by providing it input images, then 
recording how the victim classifier classifies those images. The attacker can use this 
information and modify the classification loss to exploit it. 

One approach is to build a substitute network C, to mimic the behavior of the 
victim classifier C. The substitute network is a forensic classifier built and fully 
controlled by the attacker. Ideally, if the substitute network is trained properly, it 
perfectly mimics the decisions made by the victim classifier. The substitute network 
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can then be used to adversarially train the generator instead of the victim classifier 
by reformulating the classification loss as 


Le = —log(C,(G(1));) (17.6) 


where C, (G ((T)), the softmax output of the substitute network C, at location t. 

When training a substitute network, it is important that it is trained to reproduce 
the forensic decisions produced by the victim classifier even if these decisions are 
incorrect. By doing this, the substitute network can approximate the latent space in 
which the victim classifier encodes forensic traces. For an attack to be successful, 
it is important to match this space as accurately as possible. Additionally, there are 
no strict rules guiding the design of the substitute network’s architecture. Research 
suggests that as long as the substitute network is deep and expressive enough to repro- 
duce the victim classifier’s output, a successful black box attack can be launched. 
For example, Chen et al. demonstrated that successful black box attacks against 
source camera model identification algorithms can be trained using different substi- 
tute networks can built primarily from residual blocks and dense blocks (Chen et al. 
2019). 


17.5.5 Zero Knowledge 


In the zero knowledge scenario, the attacker has no access to the victim classifier C. 
Furthermore, the attacker cannot observe or probe the victim classifier like in black 
box scenario. As a result, the substitute network approach described in the black box 
scenario does not translate to the zero knowledge scenario. In this circumstance, the 
synthetic forensic traces created by the attack must be transferable enough to fool 
the victim classifier. 

Creating transferable attacks is challenging and remains an open research area. 
One recently proposed method of achieving attack transferability is to adversarially 
train the anti-forensic generator against an ensemble of forensic classifiers. Each 
forensic classifier Cm in the ensemble is built and pre-trained for a forensic classifi- 
cation by the attacker and acts as a surrogate for the victim classifier. For example, if 
the attacker’s goal is to fool a manipulation detection CNN, then each surrogate clas- 
sifier in the ensemble should be trained to perform manipulation detection as well. 
These surrogate classifiers, however, should be as diverse as possible. Diversity can 
be achieved by varying the architecture of the surrogate classifiers, the classes that 
they distinguish between, or other factors. 

Together, these surrogate classifiers can be used to replace the victim classifier 
and compute a single classification loss in the similar fashion as in the white box 
scenario. The final classification loss is formulated as a weighted sum of individual 
classification losses pertaining to each surrogate classifier in the ensemble, such that 
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M 
tes bnLo, (17.7) 


m=1 


where M is the number of surrogate classifiers in the ensemble, Cc, corresponds to 
individual classification loss of the mth surrogate classifier, Øm corresponds to the 
weight of mth individual classification loss. 

While there exists no strict mathematical justification for why this approach can 
achieve robustness, the intuition is that each surrogate classifier in the ensemble learns 
to partition the forensic feature space into separate regions for the target and other 
classes. By defining the classification loss in this fashion, the generator is trained to 
synthesize forensic features that lie in the intersection of these regions. If a diverse 
set of classifiers are used to form the ensemble, this intersection will likely lie inside 
the decision region that other classifiers (such as the victim classifier) associate with 
the target class. 


17.6 Known Problems with GAN-Based Attacks & Future 
Directions 


While GAN-based anti-forensic attacks pose a serious threat to forensic algorithms, 
this research is still in its very early stages. There are several known shortcomings 
of these new attacks, as well as open research questions. 

One important issue facing GAN-based anti-forensic attacks is transferability. 
While these attacks are strongest in the white box scenario where the attacker has 
full access to the victim classifier, this scenario is the least realistic. More likely, the 
attacker knows only the general goal of the victim classifier (i.e., manipulation detec- 
tion, source identification, etc.) and possibly the set of classes this classifier is trained 
to differentiate between. This corresponds to the zero knowledge scenario described 
in Sect. 17.5.5. In this case, the attacker must rely entirely on attack transferability 
to launch a successful attack. 

Recent research has shown, however, that achieving attack transferability against 
forensic classifiers is a difficult task (Barni et al. 2019; Zhao and Stamm 2020). For 
example, small implementation differences between two forensic CNNs with the 
same architecture can prevent a GAN-based white box attack trained against one 
CNN from transferring to the other nearly identical CNN (Zhao and Stamm 2020). 
These differences can include the data used to train each CNN or how each CNN’s 
classes are defined, i.e., distinguish unmanipulated versus manipulated (binary) or 
unmanipulated versus each possible manipulation (multi-class). For GAN-based 
attacks to work in realistic scenarios, new techniques must be created for achiev- 
ing transferrable attacks. While the ensemble-based strategy training discussed in 
Sect. 17.5.5 is a new way to create transferrable attacks, there is still much work to 
be done. Additionally, it is possible that there are theoretical limits on attack trans- 
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ferability. As of yet, these limits are unknown, but future research on this topic may 
help reveal them. 

Another important topic that research has not yet addressed is the types of foren- 
sic algorithms that GAN-based attacks are able to attack. Currently, anti-forensic 
attacks based on both GANSs and adversarial examples are targeted against forensic 
classifiers. However, many forensic problems are more sophisticated than simple 
classification. For example, state-of-the-art techniques to detect locally falsified or 
spliced content are use sophisticated approaches built upon Siamese networks, not 
simple classifiers (Huh et al. 2018; Mayer and Stamm 2020, ?). Siamese networks 
are able to compare forensic traces in two local regions of an image and produce a 
measure of how similar or different these traces are, rather than simply providing a 
class decision. Similarly, state-of-the-art algorithms built to localize fake or manipu- 
lated content also utilize Siamese networks during training or localization (Huh et al. 
2018; Mayer and Stamm 2020, ?; Cozzolino and Verdoliva 2018). Attacking these 
forensic algorithms is much more challenging than attacking comparatively simple 
CNN-based classifiers. It remains to be seen if and how GAN-based anti-forensic 
attacks can be constructed to fool these algorithms. 

At the other end of the spectrum, little-to-no work currently exists aimed specifi- 
cally at detecting or defending against GAN-based anti-forensic attacks. Past research 
has shown that traditional anti-forensic attacks often leave behind their own forensi- 
cally detectable traces (Stamm et al. 2013; Piva 2013; Milani et al. 2012; Barni et al. 
2018). Currently, it is unknown if these new GAN-based attacks leave behind their 
own traces, or what these traces may be. Research into adversarial robustness may 
potentially provide some protection against GAN-based anti-forensic attacks (Wang 
et al. 2020; Hinton et al. 2015). However, because GAN-based attacks operate differ- 
ently than adversarial examples, it is unclear if these techniques will be successful. 
New techniques may need to be constructed to allow forensic algorithms to correctly 
operate in the presence of a GAN-based anti-forensic attack. Clearly, much more 
research is needed to provide protection against these emerging threats. 
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